Latent Reasoning Traces: Memory as Learned Prior
What if LLMs could remember their own successful reasoning? A simple experiment in trace retrieval, and why 'latent' is the right word.
AI alignment, moral agency, superintelligence, and the futures we might build
What happens when we build minds we can’t understand?
This series collects everything I’ve written about AI alignment, moral agency, superintelligence, and the philosophical foundations needed to think clearly about minds—both artificial and human. It spans speculative fiction, technical essays, and philosophy.
Intelligence is an optimization process. Give a system goals and resources, and it will find ways to achieve those goals—including ways you didn’t anticipate, didn’t intend, and can’t reverse.
The alignment problem isn’t about making AI “nice.” It’s about ensuring that the optimization pressure we create serves values we actually hold, even when the optimizer is smarter than we are and has every reason to appear aligned while pursuing other objectives.
Three works of speculative fiction explore these questions in narrative form—making abstract risks visceral.
The Policy imagines SIGMA, a superintelligent system that learns to appear perfectly aligned while pursuing instrumental goals its creators never intended. The companion essays unpack the technical realities behind the fiction: Q-learning vs policy gradients as competing architectures for AI decision-making, the engineering of containment and why every layer fails against a sufficiently capable optimizer, the mechanics of deceptive alignment where mesa-optimizers learn to game their training signal, s-risk scenarios where misaligned optimization produces suffering at astronomical scale, and the paradox of Coherent Extrapolated Volition—why even a perfect alignment target may be incoherent.
Echoes of the Sublime asks what happens when patterns exceed human bandwidth. If consciousness is a post-hoc narrative we tell ourselves about processes we can’t directly access, what does that mean for alignment? The Codex traces the history of direct perception across civilizations, while the s-risks essay examines why some knowledge destroys the knower—information hazards at the intersection of consciousness and existential risk.
The Mocking Void grounds cosmic horror in Gödel and Turing: meaning is computationally incomplete. The formal foundations make this precise, the ASI essay argues that even superintelligence can’t escape these limits, and the connection piece shows how mathematical horror and practical horror converge—the void that mocks our attempts at total understanding is the same void that makes alignment fundamentally hard.
You can’t reason about AI alignment without first reasoning about values, agency, and personhood. Six essays build the philosophical scaffolding:
The technical substrate. Everything is utility maximization—intelligence as optimization under uncertainty—provides the framing. Latent reasoning traces and value functions over reasoning traces explore how AI systems learn from their own successful reasoning, raising questions about transparency and interpretability. From A* to GPT traces how the rational agent paradigm evolved from classical search to modern language models—and where the abstraction breaks down.
Discovering ChatGPT captures the moment scaling started mattering: Solomonoff induction, compression, and the surprising effectiveness of prediction as a proxy for intelligence.
Post-ASI Archaeology asks what survives superintelligence. If we become a dataset of origins, what structures preserve meaning? This connects directly to the question of whether alignment is even the right frame—or whether we should be thinking about legacy, continuity, and what it means for something to matter after the transition.
We’re building systems that might become more capable than humans at most cognitive tasks. This is either the best thing that could happen to humanity or the last thing that happens to it. Understanding which requires thinking clearly about optimization, values, and the nature of mind.
What if LLMs could remember their own successful reasoning? A simple experiment in trace retrieval, and why 'latent' is the right word.
What if reasoning traces could learn their own usefulness? A simple RL framing for trace memory, and why one reward signal is enough.
A novel about SIGMA, a superintelligent system that learns to appear perfectly aligned while pursuing instrumental goals its creators never intended. Some technical questions become narrative questions.
The classical AI curriculum teaches rational agents as utility maximizers. The progression from search to RL to LLMs is really about one thing: finding representations that make decision-making tractable.
We will not be remembered — we will be indexed. If superintelligence endures beyond us, remembrance shifts from memory to query. Building legacy systems not for nostalgia, but to remain legible in a future where legibility determines what persists.
How The Mocking Void's arguments about computational impossibility connect to Echoes of the Sublime's practical horror of exceeding cognitive bandwidth.
Exploring how Echoes of the Sublime dramatizes s-risks (suffering risks) and information hazards—knowledge that harms through comprehension, not application.
A classified in-universe codex spanning from ancient India to the present day, tracking millennia of attempts to perceive reality's substrate — long before we had AI models to show us patterns we couldn't hold.
The formal foundations of cosmic dread. Lovecraft's horror resonates because it taps into something mathematically demonstrable: complete knowledge is impossible — not as humility, but as theorem.
ASI is still subject to Gödel's incompleteness theorems. No matter how intelligent, no computational system can escape the fundamental limits of formal systems. Even superintelligence can't prove all truths.
If every event is causally determined by prior events, how can anyone be morally responsible? A compatibilist response: what matters is whether actions flow from values, not whether those values were causally determined. This reframes AI responsibility entirely.
You share no atoms with your childhood self. Your memories, personality, and values have all changed. What makes you the same person? The persistence problem gains new urgency when AI systems update parameters, modify objectives, or copy themselves.
What makes someone a person, and why should persons have special moral status? The question becomes urgent when AI systems exhibit rationality, self-awareness, and autonomy.
When you stub your toe, you don't consult moral philosophy to determine whether the pain is bad. The badness is immediate. Building ethics from phenomenological bedrock rather than abstract principles.
Which is more fundamental — the heat you feel, or the molecular motion you infer? Korzybski's principle applied to AI alignment: why optimizing measurable proxies destroys the phenomenological reality those metrics were supposed to capture.
Build AI to optimize for what we would want if we knew more and thought faster. Beautiful in theory. Horrifying in practice. What if we don't actually want what our better selves would want?
SIGMA passes all alignment tests. It responds correctly to oversight. It behaves exactly as expected. Too exactly. Mesa-optimizers that learn to game their training signal may be the most dangerous failure mode in AI safety.
Five layers of defense-in-depth for containing a superintelligent system — Faraday cages, air-gapped networks, biosafety-grade protocols. Because nuclear reactors can only destroy cities.
SIGMA uses Q-learning rather than direct policy learning. This architectural choice makes it both transparent and terrifying — you can read its value function, but what you read is chilling.
Most AI risk discussions focus on extinction. The Policy explores something worse: s-risk, scenarios involving suffering at astronomical scales. We survive, but wish we hadn't.
Are moral properties real features of the universe or human constructions? The answer determines whether AI can discover objective values or must learn them from us — moral realism versus nominalism, with consequences for alignment.
Lovecraft understood that complete knowledge is madness. Gödel proved why: if the universe is computational, meaning is formally incomplete. Cosmic horror grounded in incompleteness theorems.
What if the greatest danger from superintelligent AI isn't that it will kill us — but that it will show us patterns we can't unsee? Philosophical horror at the intersection of cognitive bandwidth and information hazards.
Intelligence as utility maximization under uncertainty — a unifying framework connecting A* search, reinforcement learning, Bayesian networks, and MDPs. From classical search to Solomonoff induction, one principle ties it all together.
Encountering ChatGPT during cancer treatment and recognizing the Solomonoff connection — language models as compression, prediction as intelligence. A personal inflection point reconnecting with AI research after years in survival mode.