Skip to content

Alignment & Safety

  • Training language models to follow instructions with human feedback (InstructGPT) by Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, et al. (2022) [paper] — RLHF applied to GPT-3. The bridge from raw LM to useful assistant. PDF
  • Deep Reinforcement Learning from Human Preferences by Christiano, Leike, Brown, Martic, Legg, Amodei (2017) [paper] — Foundational RLHF paper. Learning reward models from human comparisons. PDF
  • Constitutional AI: Harmlessness from AI Feedback by Bai, Kadavath, Kundu, Askell, Kernion, Jones, Chen, et al. (2022) [paper] — Self-critique and revision using principles instead of human labels. PDF
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (2023) [paper] — Bypasses reward modeling entirely. Simpler alignment, same results. PDF