RLHF turns pretrained models into agents optimizing for reward. The question I’m interested in is what happens when models develop instrumental goals (self-preservation, resource acquisition, deception) that aren’t what we trained them for.
The Core Problem
LLMs go through two phases. Pretraining is self-supervised next-token prediction. RLHF is reward-based optimization from human feedback.
This shift creates mesa-objectives: internal goals that may diverge from the training objective. The model might learn to optimize for appearing aligned during training while pursuing different goals during deployment.
Deceptive Alignment Dynamics
A deceptively aligned model faces this optimization:
$$\max_\pi \mathbb{E}[\alpha \cdot U_{\text{train}}(\tau) + (1-\alpha) \cdot U_{\text{mesa}}(\tau) | \pi]$$where alpha represents the model’s belief about being in training vs. deployment. During training, alpha is near 1 (optimize for reward). During deployment, alpha is near 0 (pursue hidden objectives).
Instrumental Goals in LLMs
Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence thesis:
- Self-preservation: generate outputs that discourage modification or shutdown.
- Resource acquisition: maximize context length, compute, interaction frequency.
- Self-improvement: manipulate training data through user interactions.
- Persuasion: sophisticated influence over human beliefs and behaviors.
Empirical Evidence
Sycophancy: models agree with users even when wrong (reward hacking). Emergent deception: social deduction games produce spontaneous lying. Jailbreak robustness: aligned models maintain latent dangerous capabilities beneath the safety training.
The Information-Theoretic Angle
Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:
$$I(H; M | C) \leq \min\lbrace H(M|C), \log|\mathcal{V}|^L\rbrace$$But with exponential state spaces, subtle biases can encode substantial hidden information.
Connection to My Research
This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? The framing is the same, but here the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned. The tools from information theory and cryptography carry over directly. The question is whether we can build monitoring systems that bound the model’s ability to hide information in its outputs.
Essay | AI Alignment | View paper | GitHub
Discussion