Sequential Prediction

From Solomonoff induction to modern language models: sequential prediction through information theory, Bayesian methods, and neural approaches

8 parts

Given a sequence x₁, x₂, …, xₙ₋₁, predict xₙ.

This is one of the most fundamental problems in machine learning. The theoretically optimal solution (Solomonoff induction) is incomputable. Every practical algorithm is an approximation, and every approximation encodes assumptions about what patterns are likely.

These posts trace the problem from its information-theoretic foundations through Context Tree Weighting to modern neural language models.

The Arc

The early posts cover the formal problem definition, Solomonoff induction (the gold standard you cannot actually run), the Bayesian framework underlying all principled approaches, data generating processes, and classical n-gram models.

The later posts bring these ideas into conversation with transformers and LLMs. CTW achieves provable optimality for tree sources with minimal compute. Neural models achieve flexibility at the cost of guarantees. The tradeoff is real and worth understanding.

CTW

A particular focus is Context Tree Weighting (Willems et al., 1995). CTW performs exact Bayesian model averaging over all tree structures, achieves provable optimality for tree sources, runs in O(depth) per symbol, and requires no hyperparameter tuning. My experiments confirm the theory: when CTW’s depth matches or exceeds the true Markov order, it hits the Bayes-optimal prediction rate.

The connection between prediction and compression is not a metaphor. Better predictions yield shorter codes (arithmetic coding). Every LLM is fundamentally a next-token predictor. Understanding the foundations clarifies why these systems work and where they break.