Introduction to Sequential Prediction
The problem of predicting what comes next, from compression to language models
From Solomonoff induction to modern language models: sequential prediction through information theory, Bayesian methods, and neural approaches
Given a sequence x₁, x₂, …, xₙ₋₁, predict xₙ.
This is one of the most fundamental problems in machine learning. The theoretically optimal solution (Solomonoff induction) is incomputable. Every practical algorithm is an approximation, and every approximation encodes assumptions about what patterns are likely.
These posts trace the problem from its information-theoretic foundations through Context Tree Weighting to modern neural language models.
The early posts cover the formal problem definition, Solomonoff induction (the gold standard you cannot actually run), the Bayesian framework underlying all principled approaches, data generating processes, and classical n-gram models.
The later posts bring these ideas into conversation with transformers and LLMs. CTW achieves provable optimality for tree sources with minimal compute. Neural models achieve flexibility at the cost of guarantees. The tradeoff is real and worth understanding.
A particular focus is Context Tree Weighting (Willems et al., 1995). CTW performs exact Bayesian model averaging over all tree structures, achieves provable optimality for tree sources, runs in O(depth) per symbol, and requires no hyperparameter tuning. My experiments confirm the theory: when CTW’s depth matches or exceeds the true Markov order, it hits the Bayes-optimal prediction rate.
The connection between prediction and compression is not a metaphor. Better predictions yield shorter codes (arithmetic coding). Every LLM is fundamentally a next-token predictor. Understanding the foundations clarifies why these systems work and where they break.
The problem of predicting what comes next, from compression to language models
The optimal predictor is incomputable. What we can learn from it anyway.
Model averaging over hypotheses, the principled way to handle uncertainty in prediction
Markov processes and tree sources: understanding where sequences come from
The classical approach to sequence prediction: counting and smoothing
The bias-data trade-off in sequential prediction: when to use CTW, n-grams, or neural language models.
The evolution of neural sequence prediction, and how it connects to classical methods
Validating Context Tree Weighting through experiments, including a bug that changed everything.