April 24, 2026
Deep Reinforcement Learning from Human Preferences
Notes
Foundational RLHF paper. Learning reward models from human comparisons.
Browse posts by tag
Foundational RLHF paper. Learning reward models from human comparisons.
RLHF applied to GPT-3. The bridge from raw LM to useful assistant.
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.