The Bayesian Prediction Framework

September 20, 2014 5 min read Updated: February 24, 2026

In the previous post, we saw that Solomonoff induction is optimal but incomputable. Its power comes from maintaining a distribution over all possible programs rather than committing to any single hypothesis.

That is Bayesian model averaging in its purest form. In this post, I develop the Bayesian framework for sequential prediction and introduce the Krichevsky-Trofimov (KT) estimator, the mathematical engine powering Context Tree Weighting.

The Core Idea: Embrace Uncertainty

The key insight of Bayesian prediction: do not pick a single model. Average over all of them.

Model Selection (the wrong way)

Traditional approach:

Evaluate candidate models on data
Pick the best one (by AIC, BIC, cross-validation, etc.)
Make predictions using only the selected model

Problems:

Ignores uncertainty about which model is correct
Overconfident predictions
Winner-take-all loses information from runner-up models

Model Averaging (the right way)

Bayesian approach:

Assign prior probabilities to models
Update to posterior probabilities given data
Average predictions across all models, weighted by posterior

This accounts for model uncertainty, gives better calibrated predictions, is more robust to model misspecification, and does not require you to “select” a model.

The Bayesian Recipe

Prior Distribution

We start with a prior P(theta) over model parameters (or model structures):

Expresses beliefs before seeing data
Should encode reasonable inductive biases
In CTW: implicit prior over tree structures (0.5 weight to “stop here” vs “go deeper”)

Likelihood

The likelihood P(data | theta) measures how well each model explains the observations:

P(x1, x2, ..., xn | theta) = prod_i P(xi | x1, ..., xi-1, theta)

Posterior Distribution

After observing data D, we update via Bayes’ rule:

P(theta | D) = P(D | theta) * P(theta) / P(D)

The posterior concentrates on models that both:

Had reasonable prior probability
Explain the data well

Posterior Predictive Distribution

For prediction, we average over all models:

P(xn | x1...xn-1) = integral P(xn | theta) * P(theta | x1...xn-1) dtheta

This is the Bayesian predictive distribution. Each model votes, weighted by its posterior probability.

The KT Estimator: Bayesian Bernoulli Prediction

The Krichevsky-Trofimov (KT) estimator provides optimal Bayesian prediction for a sequence of coin flips.

Setup

We observe a sequence of bits with:

a zeros observed
b ones observed

Question: what is P(next bit = 1)?

Maximum Likelihood Estimate (MLE)

The obvious answer:

P(1) = b / (a + b)

Problem: After seeing only zeros, MLE predicts P(1) = 0 forever. One surprising 1 causes infinite surprise.

Laplace Smoothing

Add-one smoothing (equivalent to Beta(1,1) = uniform prior):

P(1) = (b + 1) / (a + b + 2)

Better, but not optimal.

KT Estimator

Uses Beta(0.5, 0.5), the Jeffreys prior:

P(1) = (b + 0.5) / (a + b + 1)

Why 0.5? The Jeffreys prior is:

Uninformative (does not favor any probability)
Minimax optimal (best worst-case redundancy)
Theoretically justified for sequential prediction

Sequential Probability

The KT probability of a sequence with a zeros and b ones:

Pe(a, b) = Gamma(a + 0.5) * Gamma(b + 0.5) / (Gamma(1) * Gamma(a + b + 1))

This can be computed sequentially:

Pe(a+1, b) = Pe(a, b) * (a + 0.5) / (a + b + 1)
Pe(a, b+1) = Pe(a, b) * (b + 0.5) / (a + b + 1)

No need to restart from scratch. Just multiply by a simple factor.

CTW’s Bayesian Mixture

Context Tree Weighting uses the KT estimator at each node, then performs Bayesian averaging over tree structures.

At each node with local KT estimate Pe and children with weighted estimates Pw(child_0) and Pw(child_1):

Pw = 0.5 * Pe + 0.5 * Pw(child_0) * Pw(child_1)

This recursion:

Assigns 0.5 prior weight to “stop here” (use local KT estimate)
Assigns 0.5 prior weight to “go deeper” (use children’s predictions)
Implicitly averages over all possible tree structures

Example: Depth-2 Tree

For a depth-2 CTW, the root’s Pw averages over:

Use only root’s KT estimate (weight ~0.5)
Use root + one level of children (weight ~0.25 for each split pattern)
Use full depth-2 tree (weight ~0.125 for each of 4 possible trees)

All 2^(2^D - 1) possible tree structures are implicitly weighted.

Why Bayesian Works

Automatic Complexity Control

Simple models (shallow trees) are preferred a priori via the 0.5 split
But if data supports complexity, posterior shifts to deeper trees
No need to hand-tune regularization parameters

Calibrated Uncertainty

Predictions reflect actual epistemic uncertainty
Important for decision-making, active learning, exploration
Better than overconfident point estimates

Theoretical Guarantees

For tree sources, CTW achieves:

Redundancy bounded by O(k log n) for k-leaf optimal tree
Convergence to the true distribution at optimal rate
Minimax optimality within its model class

Bayesian vs. Frequentist

Aspect	Bayesian (CTW)	Frequentist (MLE)
Parameters	Distribution over values	Point estimate
Model selection	Average over models	Choose one
Small samples	Handles gracefully	Often overfits
Computation	Sometimes tractable	Usually simpler
Guarantees	Posterior consistency	Asymptotic normality

For sequential prediction with limited data, the Bayesian approach dominates.

Practical Implications

When to Use Bayesian Model Averaging

Limited data: Uncertainty matters more
Multiple plausible models: No clear winner
Calibration required: Need honest uncertainty
Online learning: Sequential updates are natural

When Simpler Approaches Suffice

Massive data: Posterior concentrates anyway
Single clearly best model: Averaging unnecessary
Computation constrained: MLE is faster
Interpretability: Single model easier to explain

The Hierarchy of Estimation

Full Bayesian (maintain full posterior)
         |
MAP (maximum a posteriori point estimate)
         |
MLE (maximum likelihood, flat prior)
         |
Heuristics (ad-hoc smoothing)

CTW operates at the top level: full Bayesian averaging over tree structures with KT estimation at leaves.

Key Takeaways

Model averaging beats model selection when uncertainty matters
The KT estimator provides optimal sequential Bernoulli prediction
CTW combines KT estimation with Bayesian averaging over trees
The 0.5/0.5 split implements an implicit Occam’s razor prior
Bayesian methods excel with limited data and online learning

What’s Next

The next post examines the data generating processes that produce sequences. Understanding Markov processes and tree sources helps us see why CTW’s model class is both tractable and powerful.

The Bayesian Prediction Framework

The Core Idea: Embrace Uncertainty

Model Selection (the wrong way)

Model Averaging (the right way)

The Bayesian Recipe

Prior Distribution

Likelihood

Posterior Distribution

Posterior Predictive Distribution

The KT Estimator: Bayesian Bernoulli Prediction

Setup

Maximum Likelihood Estimate (MLE)

Laplace Smoothing

KT Estimator

Sequential Probability

CTW’s Bayesian Mixture

Example: Depth-2 Tree

Why Bayesian Works

Automatic Complexity Control

Calibrated Uncertainty

Theoretical Guarantees

Bayesian vs. Frequentist

Practical Implications

When to Use Bayesian Model Averaging

When Simpler Approaches Suffice

The Hierarchy of Estimation

Key Takeaways

What’s Next

Further Reading

Discussion

The Core Idea: Embrace Uncertainty

Model Selection (the wrong way)

Model Averaging (the right way)

The Bayesian Recipe

Prior Distribution

Likelihood

Posterior Distribution

Posterior Predictive Distribution

The KT Estimator: Bayesian Bernoulli Prediction

Setup

Maximum Likelihood Estimate (MLE)

Laplace Smoothing

KT Estimator

Sequential Probability

CTW’s Bayesian Mixture

Example: Depth-2 Tree

Why Bayesian Works

Automatic Complexity Control

Calibrated Uncertainty

Theoretical Guarantees

Bayesian vs. Frequentist

Practical Implications

When to Use Bayesian Model Averaging

When Simpler Approaches Suffice

The Hierarchy of Estimation

Key Takeaways

What’s Next

Further Reading

Related Posts

CTW Experimental Results: Theory Meets Practice

Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

N-gram Language Models

Data Generating Processes for Sequential Prediction

Solomonoff Induction: The Theoretical Ideal

Discussion