RLHF

Browse posts by tag

April 24, 2026

Deep Reinforcement Learning from Human Preferences

Notes

Foundational RLHF paper. Learning reward models from human comparisons.

April 24, 2026

Training language models to follow instructions with human feedback (InstructGPT)

Notes

RLHF applied to GPT-3. The bridge from raw LM to useful assistant.

March 20, 2024

Instrumental Goals and Hidden Codes in RLHF'd Language Models

How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.

AI Safety Machine Learning