Skip to main content
← All Series

Sequential Prediction

From Solomonoff induction to modern language models—exploring sequential prediction through information theory, Bayesian methods, and neural approaches

8 parts

What comes next?

This question drives one of the most fundamental problems in machine learning and information theory. Given a sequence x₁, x₂, …, xₙ₋₁, how do we predict xₙ? The answer connects Shannon’s information theory to Bayesian inference to the transformer architectures powering modern AI.

The Journey

This series traces the evolution of sequential prediction from its theoretical foundations to practical implementations:

Foundations (2010-2016): We begin with the formal problem definition and Solomonoff induction—the theoretical ideal that provides the gold standard for prediction. From there, we develop the Bayesian framework that underlies all principled approaches, explore the data generating processes that create sequences, and examine classical n-gram models.

Modern Perspectives (2024-2025): The later posts bring these classical ideas into conversation with modern neural language models. We compare CTW’s sample efficiency with transformers’ flexibility, and conclude with experimental validation of Context Tree Weighting on structured binary sequences.

What You’ll Learn

  • Information-theoretic foundations: The deep connection between prediction and compression
  • Solomonoff induction: Why the theoretically optimal predictor is incomputable
  • Bayesian model averaging: How CTW achieves optimality for tree sources
  • The KT estimator: The mathematical engine behind CTW’s predictions
  • N-gram models to transformers: The evolution from counting to attention
  • Practical trade-offs: When to use CTW vs. neural approaches

The Code

This series is accompanied by working code:

  • CTW implementation: Pure Python Context Tree Weighting with O(depth) updates
  • Markov generators: Configurable data generating processes for experiments
  • Experiments: Reproduce results showing CTW achieves Bayes-optimal accuracy
from experiments.ctw import CTW, generate_sequence

# Generate from a 3rd-order Markov process
seq = generate_sequence(10000, seed=42)

# CTW predictor with depth 3
ctw = CTW(max_depth=3)
accuracy = ctw.accuracy(seq)
print(f"Accuracy: {accuracy:.1%}")

Why This Matters

Sequential prediction isn’t just an academic exercise. It’s the core problem underlying:

  • Compression: Better predictions yield shorter codes (arithmetic coding)
  • Language models: GPT, Claude, and every LLM is fundamentally a next-token predictor
  • Anomaly detection: Surprising events have low predicted probability
  • Reinforcement learning: World models predict state transitions

Understanding the foundations—from Kolmogorov complexity to Bayesian model averaging—provides deep insight into why modern AI systems work and where their guarantees come from.

The CTW Story

A particular focus of this series is Context Tree Weighting (Willems et al., 1995), an algorithm that:

  • Performs exact Bayesian model averaging over all tree structures
  • Achieves provable optimality for tree sources
  • Uses minimal compute (O(depth) per symbol)
  • Requires no hyperparameter tuning

Our experiments confirm the theory: when CTW’s depth matches or exceeds the true order of a Markov source, it achieves the theoretical maximum accuracy—the Bayes-optimal prediction rate.

Posts in this Series

Showing 8 of 8 posts