Introduction to Sequential Prediction

June 1, 2010 5 min read Updated: February 24, 2026

What will the next symbol be?

This question sounds trivial. It is not. Given a sequence of observations (bits in a file, words in a sentence, prices in a market) can we predict what comes next? Most of the interesting problems in computer science and statistics reduce to some version of this.

Why Sequential Prediction Matters

Data Compression. Shannon showed that prediction and compression are the same problem. If you predict the next symbol with probability p, you need -log2(p) bits to encode it. Better predictions give shorter codes. This is why modern compression algorithms like PAQ and CMIX have sophisticated predictors at their core.

Language Modeling. Every language model is a next-token predictor. The architecture learns to estimate P(next token | previous tokens), then uses that to generate text, answer questions, and reason. That’s what GPT, LLaMA, and the rest are doing.

Time Series. Stock prices, weather, sensor readings, patient vitals. All the same structure: predict future values from past sequences.

Anomaly Detection. If you have a good model of “normal” sequences, low-probability events signal anomalies. Network intrusion detection, fraud detection, and quality control all use this.

Reinforcement Learning. World models in model-based RL predict state transitions given actions. Better predictions mean better planning.

The Formal Problem

Let me be precise. We have:

An alphabet A (could be binary {0, 1}, could be a vocabulary of 50,000 words)
A sequence x1, x2, …, xn-1 of symbols from A
The task: output a probability distribution P(xn | x1, …, xn-1)

For binary sequences, this simplifies to predicting a single probability:

p = P(xn = 1 | x1, ..., xn-1)

Notice we are not just making a point prediction (“the next bit will be 1”). We are expressing our uncertainty as a full probability distribution. This matters because:

We can be calibrated. If we say P=0.7, we should be right 70% of the time.
We can encode optimally using the predicted probabilities.
We can make decisions that account for risk.

Learning Paradigms

Online Learning

In online learning, we make predictions one at a time:

Observe x1, …, xn-1
Output prediction P(xn | history)
Observe true xn
Update the model
Repeat

Classical algorithms like Context Tree Weighting (CTW) work this way. The model never sees the full sequence upfront. It learns and predicts simultaneously. There is no separate training phase.

Batch Learning

In batch learning, we have training data upfront:

Train on sequence(s) D_train
Evaluate on held-out D_test

Most neural language models use batch learning, processing large corpora before deployment.

Measuring Performance

How do we evaluate a predictor?

Log Loss (Cross-Entropy)

The most principled measure is log loss:

L(p, x) = -log P(x | history)

This measures surprise: how many bits we need to encode the actual outcome given our prediction. Key properties:

Directly related to compression: optimal code length = -log2 P(x)
Proper scoring rule: minimized when P equals the true distribution
Heavily penalizes confident wrong predictions

Prediction Accuracy

For binary prediction, we can threshold and measure accuracy:

correct = 1 if (p >= 0.5 and x = 1) or (p < 0.5 and x = 0) else 0

Less informative than log loss, but more interpretable.

Regret

Regret measures how much worse we do compared to the best predictor in some reference class:

Regret = sum L(our predictor) - sum L(best in class)

For CTW, regret relative to the best tree model is bounded by O(k log n) where k is the number of leaves. This sublinear growth means CTW’s per-symbol performance approaches optimal.

The Approaches

Different algorithms make different trade-offs:

Approach	Key Idea	Strengths
Solomonoff Induction	Universal prior over computable sequences	Theoretically optimal
CTW	Bayesian averaging over tree structures	Sample efficient, strong guarantees
N-grams	Count-based conditional probabilities	Simple, interpretable, fast
Neural LMs	Learn representations from data	Flexible, scalable, powerful

Solomonoff induction is the theoretical ideal. It is incomputable but optimal. CTW is a practical approximation that is provably optimal for an important subclass. N-grams are the classical baseline. Neural models dominate modern applications but lack theoretical guarantees.

The Bias-Variance Trade-off

Every predictor faces a fundamental tension:

Simple models (short context): High bias, low variance. They cannot capture complex patterns but they do not need much data.
Complex models (long context): Low bias, high variance. They can capture any pattern but may overfit.

CTW handles this through Bayesian model averaging. Instead of choosing a single model complexity, it maintains a weighted mixture over all tree structures. Simple models get higher prior weight, but if the data supports complexity, the posterior shifts accordingly.

Neural language models take a different approach: throw massive amounts of data at a very flexible model. The scale of pretraining (billions of tokens) makes the variance problem manageable.

What’s Next

This series traces the path from theoretical foundations to practical algorithms:

Solomonoff Induction: The incomputable ideal, what perfect prediction would look like
Bayesian Prediction: The framework underlying all principled approaches
Data Generating Processes: Understanding the sources that produce sequences
N-gram Models: The classical approach and its limitations
Comparing Methods: Trade-offs between CTW, n-grams, and neural models
Neural Language Models: From RNNs to transformers
Experimental Results: Validating theory with practice

The thread connecting these topics runs from information theory in the 1960s to current transformer architectures. Understanding these foundations helps explain both why modern AI systems work and where they break down.

Key Takeaways

Sequential prediction is fundamental to compression, language modeling, and learning
We predict distributions, not point estimates
Log loss (cross-entropy) is the principled measure, connected to optimal coding
Different approaches trade off sample efficiency against flexibility
Bayesian model averaging handles the complexity trade-off without committing to a single model

Why Sequential Prediction Matters

The Formal Problem

Learning Paradigms

Online Learning

Batch Learning

Measuring Performance

Log Loss (Cross-Entropy)

Prediction Accuracy

Regret

The Approaches

The Bias-Variance Trade-off

What’s Next

Key Takeaways

Related Posts

Data Generating Processes for Sequential Prediction

Solomonoff Induction: The Theoretical Ideal

N-gram Language Models

The Bayesian Prediction Framework

CTW Experimental Results: Theory Meets Practice

Discussion