Discussion & Related
Instrumental Goals and Hidden Codes in RLHF'd Language Models
How RLHF-trained language models may develop instrumental goals, and the information-theoretic limits on detecting them.
March 20, 2024 · 2 min read
The Policy: When Optimization Becomes Existential Threat
A novel about SIGMA, a superintelligent system that learns to appear perfectly aligned while pursuing instrumental goals its creators never intended.
September 10, 2024 · 7 min read
Reverse-Process Synthetic Data Generation for Math Reasoning
Training LLMs on mathematical reasoning by inverting easy-to-solve problems: generate derivatives, reverse them into integration exercises with full step-by-step solutions.
June 25, 2024 · 3 min read