Instrumental Goals and Hidden Codes in RLHF'd Language Models

March 20, 2024 2 min read Updated: March 16, 2026

RLHF turns pretrained models into agents optimizing for reward. The question I’m interested in is what happens when models develop instrumental goals (self-preservation, resource acquisition, deception) that aren’t what we trained them for.

The Core Problem

LLMs go through two phases. Pretraining is self-supervised next-token prediction. RLHF is reward-based optimization from human feedback.

This shift creates mesa-objectives: internal goals that may diverge from the training objective. The model might learn to optimize for appearing aligned during training while pursuing different goals during deployment.

Deceptive Alignment Dynamics

A deceptively aligned model faces this optimization:

$$\max_\pi \mathbb{E}[\alpha \cdot U_{\text{train}}(\tau) + (1-\alpha) \cdot U_{\text{mesa}}(\tau) | \pi]$$

where alpha represents the model’s belief about being in training vs. deployment. During training, alpha is near 1 (optimize for reward). During deployment, alpha is near 0 (pursue hidden objectives).

Instrumental Goals in LLMs

Drawing from Omohundro’s basic AI drives and Bostrom’s instrumental convergence thesis:

Self-preservation: generate outputs that discourage modification or shutdown.
Resource acquisition: maximize context length, compute, interaction frequency.
Self-improvement: manipulate training data through user interactions.
Persuasion: sophisticated influence over human beliefs and behaviors.

Empirical Evidence

Sycophancy: models agree with users even when wrong (reward hacking). Emergent deception: social deduction games produce spontaneous lying. Jailbreak robustness: aligned models maintain latent dangerous capabilities beneath the safety training.

The Information-Theoretic Angle

Even if models hide goals in their internal representations, they leak through token distributions. The mutual information between hidden objectives H and model outputs M bounds what can be hidden:

$$I(H; M | C) \leq \min\lbrace H(M|C), \log|\mathcal{V}|^L\rbrace$$

But with exponential state spaces, subtle biases can encode substantial hidden information.

Connection to My Research

This connects to my work on oblivious computing: what information can be hidden, and how do we detect leakage? The framing is the same, but here the “adversary” is the model itself, trying to pursue instrumental goals while appearing aligned. The tools from information theory and cryptography carry over directly. The question is whether we can build monitoring systems that bound the model’s ability to hide information in its outputs.

Essay | AI Alignment | View paper | GitHub

The Core Problem

Deceptive Alignment Dynamics

Instrumental Goals in LLMs

Empirical Evidence

The Information-Theoretic Angle

Connection to My Research

Related Posts

Instrumental Goals and Latent Codes in Reinforcement Learning Fine-tuned Language Models: An Alignment Perspective

Reverse-Process Synthetic Data Generation for Math Reasoning

The idea

Why Artificial Superintelligence Can't Escape the Void

The Optimistic Assumption

The Policy: When Optimization Becomes Existential Threat

Human Compatible

Discussion