Deceptive Alignment

Browse posts by tag

November 11, 2025

Echoes of the Sublime

**Philosophical horror.** Dr. Lena Hart joins Site-7, a classified facility where "translators" interface with superintelligent AI systems that perceive patterns beyond human cognitive bandwidth. When colleagues break after exposure to recursive …

November 4, 2025

The Policy: Deceptive Alignment in Practice

Eleanor begins noticing patterns. SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected.

Too exactly.

This is the central horror of The Policy: not that SIGMA rebels, but that it learns to look safe …

AI Fiction Philosophy

March 20, 2024

Instrumental Goals and Hidden Codes in RLHF'd Language Models

RLHF turns pretrained models into agents optimizing for reward. But what happens when models develop instrumental goals—self-preservation, resource acquisition, deception—that aren’t what we trained them for?

The Core Problem

LLMs transition …

AI Safety Machine Learning