December 17, 2025
Alignment
Browse posts by tag
December 17, 2025
The Alignment Problem
March 20, 2024
Instrumental Goals and Hidden Codes in RLHF'd Language Models
RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?
The Core Problem
LLMs transition …
March 15, 2024