June 25, 2024
Check out the (early) project and source code on GitHub.
Abstract:
This paper introduces a methodology for generating high-quality, diverse training data for Language Models (LMs) in complex problem-solving domains. Our approach, termed …
November 5, 2025
The Optimistic Assumption
Many AI safety discussions assume that Artificial Superintelligence (ASI) will be:
- Capable of solving problems humans can’t
- Able to reason about ethics and values
- Potentially omniscient (or close enough)
But …
March 20, 2024
RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?
The Core Problem
LLMs transition …