Alignment

Browse posts by tag

April 24, 2026

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Bypasses reward modeling entirely. Simpler alignment, same results.

April 24, 2026

RLHF applied to GPT-3. The bridge from raw LM to useful assistant.

December 17, 2025

December 17, 2025

March 20, 2024

How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.

March 15, 2024