What will the next symbol be?
This deceptively simple question lies at the heart of some of the most important problems in computer science and statistics. Given a sequence of observations—bits in a file, words in a sentence, prices in a market—can we predict what comes next?
Why Sequential Prediction Matters
The problem of sequential prediction underlies a remarkable range of applications:
Data Compression: Shannon’s foundational insight connects prediction directly to compression. If you can predict the next symbol with probability p, you only need -log₂(p) bits to encode it. Better predictions → shorter codes. This is why modern compression algorithms like PAQ and CMIX use sophisticated predictors internally.
Language Modeling: Every large language model—GPT, Claude, LLaMA—is fundamentally a next-token predictor. The transformer architecture learns to estimate P(next token | previous tokens), then uses this to generate text, answer questions, and reason about problems.
Time Series Forecasting: Stock prices, weather patterns, sensor readings, patient vitals—all involve predicting future values from historical sequences.
Anomaly Detection: If we have a good model of “normal” sequences, then low-probability events signal anomalies. Network intrusion detection, fraud detection, and quality control all use this principle.
Reinforcement Learning: World models in model-based RL predict how states transition given actions. Better predictions enable better planning.
The Formal Problem
Let’s be precise. We have:
- An alphabet A (could be binary {0, 1}, could be a vocabulary of 50,000 words)
- A sequence x₁, x₂, …, xₙ₋₁ of symbols from A
- The task: output a probability distribution P(xₙ | x₁, …, xₙ₋₁)
For binary sequences, this simplifies to predicting a single probability:
p = P(xₙ = 1 | x₁, ..., xₙ₋₁)
Notice we’re not just making a point prediction (“the next bit will be 1”). We’re expressing our uncertainty as a full probability distribution. This matters because:
- We can be calibrated—if we say P=0.7, we should be right 70% of the time
- We can encode optimally using the predicted probabilities
- We can make decisions that account for risk
Learning Paradigms
Online Learning
In online learning, we make predictions one at a time:
- Observe x₁, …, xₙ₋₁
- Output prediction P(xₙ | history)
- Observe true xₙ
- Update our model
- Repeat
This is the paradigm many classical algorithms like Context Tree Weighting (CTW) operate in. The model never sees the full sequence upfront—it learns and predicts simultaneously. There’s no separate “training phase.”
Batch Learning
In batch learning, we have access to training data upfront:
- Train on sequence(s) D_train
- Evaluate on held-out D_test
Most neural language models use batch learning, processing large corpora before deployment.
Measuring Performance
How do we evaluate a predictor?
Log Loss (Cross-Entropy)
The most principled measure is log loss:
L(p, x) = -log P(x | history)
This measures “surprise”—how many bits we need to encode the actual outcome given our prediction. Key properties:
- Directly related to compression: optimal code length = -log₂ P(x)
- Proper scoring rule: minimized when P equals the true distribution
- Heavily penalizes confident wrong predictions
Prediction Accuracy
For binary prediction, we can threshold and measure accuracy:
correct = 1 if (p ≥ 0.5 and x = 1) or (p < 0.5 and x = 0) else 0
Less informative than log loss, but more interpretable in many contexts.
Regret
Regret measures how much worse we do compared to the best predictor in some class:
Regret = Σ L(our predictor) - Σ L(best in class)
For CTW, the regret relative to the best tree model is bounded by O(k log n) where k is the number of leaves. This sublinear growth means CTW’s per-symbol performance approaches optimal.
The Approaches
Different algorithms make different trade-offs:
| Approach | Key Idea | Strengths |
|---|---|---|
| Solomonoff Induction | Universal prior over computable sequences | Theoretically optimal |
| CTW | Bayesian averaging over tree structures | Sample efficient, strong guarantees |
| N-grams | Count-based conditional probabilities | Simple, interpretable, fast |
| Neural LMs | Learn representations from data | Flexible, scalable, powerful |
Solomonoff induction is the theoretical ideal—incomputable but optimal. CTW is a practical approximation that’s provably optimal for an important subclass. N-grams are the classical baseline. Neural models dominate modern applications but lack theoretical guarantees.
The Bias-Variance Trade-off
Every predictor faces a fundamental tension:
- Simple models (short context): High bias, low variance. They can’t capture complex patterns but don’t need much data.
- Complex models (long context): Low bias, high variance. They can capture any pattern but may overfit.
CTW addresses this elegantly through Bayesian model averaging. Rather than choosing a single model complexity, it maintains a weighted mixture over all tree structures. Simple models get higher prior weight, but if the data supports complexity, the posterior shifts accordingly.
Neural language models take a different approach: throw massive amounts of data at a very flexible model. The scale of modern pretraining (billions of tokens) makes the variance problem manageable.
What’s Next
This series will trace the path from theoretical foundations to practical algorithms:
- Solomonoff Induction: The incomputable ideal—what perfect prediction would look like
- Bayesian Prediction: The framework underlying all principled approaches
- Data Generating Processes: Understanding the sources that produce sequences
- N-gram Models: The classical approach and its limitations
- Comparing Methods: Trade-offs between CTW, n-grams, and neural models
- Neural Language Models: From RNNs to transformers
- Experimental Results: Validating theory with practice
The journey connects information theory from the 1960s to the transformer architectures of today. Understanding these foundations helps explain both why modern AI systems work and where their limitations come from.
Key Takeaways
- Sequential prediction is fundamental to compression, language modeling, and learning
- We predict distributions, not just point estimates
- Log loss (cross-entropy) is the principled measure, connected to optimal coding
- Different approaches trade off sample efficiency against flexibility
- Bayesian model averaging handles the complexity trade-off elegantly
Discussion