Skip to main content

Introduction to Sequential Prediction

What will the next symbol be?

This question sounds trivial. It is not. Given a sequence of observations (bits in a file, words in a sentence, prices in a market) can we predict what comes next? Most of the interesting problems in computer science and statistics reduce to some version of this.

Why Sequential Prediction Matters

Data Compression. Shannon showed that prediction and compression are the same problem. If you predict the next symbol with probability p, you need -log2(p) bits to encode it. Better predictions give shorter codes. This is why modern compression algorithms like PAQ and CMIX have sophisticated predictors at their core.

Language Modeling. Every language model is a next-token predictor. The architecture learns to estimate P(next token | previous tokens), then uses that to generate text, answer questions, and reason. That’s what GPT, LLaMA, and the rest are doing.

Time Series. Stock prices, weather, sensor readings, patient vitals. All the same structure: predict future values from past sequences.

Anomaly Detection. If you have a good model of “normal” sequences, low-probability events signal anomalies. Network intrusion detection, fraud detection, and quality control all use this.

Reinforcement Learning. World models in model-based RL predict state transitions given actions. Better predictions mean better planning.

The Formal Problem

Let me be precise. We have:

  • An alphabet A (could be binary {0, 1}, could be a vocabulary of 50,000 words)
  • A sequence x1, x2, …, xn-1 of symbols from A
  • The task: output a probability distribution P(xn | x1, …, xn-1)

For binary sequences, this simplifies to predicting a single probability:

p = P(xn = 1 | x1, ..., xn-1)

Notice we are not just making a point prediction (“the next bit will be 1”). We are expressing our uncertainty as a full probability distribution. This matters because:

  1. We can be calibrated. If we say P=0.7, we should be right 70% of the time.
  2. We can encode optimally using the predicted probabilities.
  3. We can make decisions that account for risk.

Learning Paradigms

Online Learning

In online learning, we make predictions one at a time:

  1. Observe x1, …, xn-1
  2. Output prediction P(xn | history)
  3. Observe true xn
  4. Update the model
  5. Repeat

Classical algorithms like Context Tree Weighting (CTW) work this way. The model never sees the full sequence upfront. It learns and predicts simultaneously. There is no separate training phase.

Batch Learning

In batch learning, we have training data upfront:

  1. Train on sequence(s) D_train
  2. Evaluate on held-out D_test

Most neural language models use batch learning, processing large corpora before deployment.

Measuring Performance

How do we evaluate a predictor?

Log Loss (Cross-Entropy)

The most principled measure is log loss:

L(p, x) = -log P(x | history)

This measures surprise: how many bits we need to encode the actual outcome given our prediction. Key properties:

  • Directly related to compression: optimal code length = -log2 P(x)
  • Proper scoring rule: minimized when P equals the true distribution
  • Heavily penalizes confident wrong predictions

Prediction Accuracy

For binary prediction, we can threshold and measure accuracy:

correct = 1 if (p >= 0.5 and x = 1) or (p < 0.5 and x = 0) else 0

Less informative than log loss, but more interpretable.

Regret

Regret measures how much worse we do compared to the best predictor in some reference class:

Regret = sum L(our predictor) - sum L(best in class)

For CTW, regret relative to the best tree model is bounded by O(k log n) where k is the number of leaves. This sublinear growth means CTW’s per-symbol performance approaches optimal.

The Approaches

Different algorithms make different trade-offs:

ApproachKey IdeaStrengths
Solomonoff InductionUniversal prior over computable sequencesTheoretically optimal
CTWBayesian averaging over tree structuresSample efficient, strong guarantees
N-gramsCount-based conditional probabilitiesSimple, interpretable, fast
Neural LMsLearn representations from dataFlexible, scalable, powerful

Solomonoff induction is the theoretical ideal. It is incomputable but optimal. CTW is a practical approximation that is provably optimal for an important subclass. N-grams are the classical baseline. Neural models dominate modern applications but lack theoretical guarantees.

The Bias-Variance Trade-off

Every predictor faces a fundamental tension:

  • Simple models (short context): High bias, low variance. They cannot capture complex patterns but they do not need much data.
  • Complex models (long context): Low bias, high variance. They can capture any pattern but may overfit.

CTW handles this through Bayesian model averaging. Instead of choosing a single model complexity, it maintains a weighted mixture over all tree structures. Simple models get higher prior weight, but if the data supports complexity, the posterior shifts accordingly.

Neural language models take a different approach: throw massive amounts of data at a very flexible model. The scale of pretraining (billions of tokens) makes the variance problem manageable.

What’s Next

This series traces the path from theoretical foundations to practical algorithms:

  1. Solomonoff Induction: The incomputable ideal, what perfect prediction would look like
  2. Bayesian Prediction: The framework underlying all principled approaches
  3. Data Generating Processes: Understanding the sources that produce sequences
  4. N-gram Models: The classical approach and its limitations
  5. Comparing Methods: Trade-offs between CTW, n-grams, and neural models
  6. Neural Language Models: From RNNs to transformers
  7. Experimental Results: Validating theory with practice

The thread connecting these topics runs from information theory in the 1960s to current transformer architectures. Understanding these foundations helps explain both why modern AI systems work and where they break down.

Key Takeaways

  • Sequential prediction is fundamental to compression, language modeling, and learning
  • We predict distributions, not point estimates
  • Log loss (cross-entropy) is the principled measure, connected to optimal coding
  • Different approaches trade off sample efficiency against flexibility
  • Bayesian model averaging handles the complexity trade-off without committing to a single model

Discussion