April 24, 2026
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Notes
Bypasses reward modeling entirely. Simpler alignment, same results.
Browse posts by tag
Bypasses reward modeling entirely. Simpler alignment, same results.
RLHF applied to GPT-3. The bridge from raw LM to useful assistant.
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.