Alignment & Safety¶

Training language models to follow instructions with human feedback (InstructGPT) by Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, et al. (2022) [paper] — RLHF applied to GPT-3. The bridge from raw LM to useful assistant. PDF
Deep Reinforcement Learning from Human Preferences by Christiano, Leike, Brown, Martic, Legg, Amodei (2017) [paper] — Foundational RLHF paper. Learning reward models from human comparisons. PDF
Constitutional AI: Harmlessness from AI Feedback by Bai, Kadavath, Kundu, Askell, Kernion, Jones, Chen, et al. (2022) [paper] — Self-critique and revision using principles instead of human labels. PDF
Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (2023) [paper] — Bypasses reward modeling entirely. Simpler alignment, same results. PDF