In the previous post, we saw that Solomonoff induction is optimal but incomputable. Its power comes from maintaining a distribution over all possible programs rather than committing to any single hypothesis.
This is Bayesian model averaging in its purest form. In this post, we develop the Bayesian framework for sequential prediction and introduce the Krichevsky-Trofimov (KT) estimator—the mathematical engine powering Context Tree Weighting.
The Core Idea: Embrace Uncertainty
The key insight of Bayesian prediction: don’t pick a single model—average over all of them.
Model Selection (the wrong way)
Traditional approach:
- Evaluate candidate models on data
- Pick the best one (by AIC, BIC, cross-validation, etc.)
- Make predictions using only the selected model
Problems:
- Ignores uncertainty about which model is correct
- Overconfident predictions
- Winner-take-all loses information from runner-up models
Model Averaging (the right way)
Bayesian approach:
- Assign prior probabilities to models
- Update to posterior probabilities given data
- Average predictions across all models, weighted by posterior
Benefits:
- Accounts for model uncertainty
- Better calibrated predictions
- More robust to model misspecification
- No need to “select” a model
The Bayesian Recipe
Prior Distribution
We start with a prior P(θ) over model parameters (or model structures):
- Expresses beliefs before seeing data
- Should encode reasonable inductive biases
- In CTW: implicit prior over tree structures (0.5 weight to “stop here” vs “go deeper”)
Likelihood
The likelihood P(data | θ) measures how well each model explains the observations:
P(x₁, x₂, ..., xₙ | θ) = ∏ᵢ P(xᵢ | x₁, ..., xᵢ₋₁, θ)
Posterior Distribution
After observing data D, we update via Bayes’ rule:
P(θ | D) = P(D | θ) × P(θ) / P(D)
The posterior concentrates on models that both:
- Had reasonable prior probability
- Explain the data well
Posterior Predictive Distribution
For prediction, we average over all models:
P(xₙ | x₁...xₙ₋₁) = ∫ P(xₙ | θ) × P(θ | x₁...xₙ₋₁) dθ
This is the Bayesian predictive distribution—each model votes, weighted by its posterior probability.
The KT Estimator: Bayesian Bernoulli Prediction
The Krichevsky-Trofimov (KT) estimator provides optimal Bayesian prediction for a sequence of coin flips.
Setup
We observe a sequence of bits with:
- a zeros observed
- b ones observed
Question: what’s P(next bit = 1)?
Maximum Likelihood Estimate (MLE)
The obvious answer:
P(1) = b / (a + b)
Problem: After seeing only zeros, MLE predicts P(1) = 0 forever! One surprising 1 causes infinite surprise.
Laplace Smoothing
Add-one smoothing (equivalent to Beta(1,1) = uniform prior):
P(1) = (b + 1) / (a + b + 2)
Better, but not optimal.
KT Estimator
Uses Beta(0.5, 0.5)—the Jeffreys prior:
P(1) = (b + 0.5) / (a + b + 1)
Why 0.5? The Jeffreys prior is:
- Uninformative (doesn’t favor any probability)
- Minimax optimal (best worst-case redundancy)
- Theoretically justified for sequential prediction
Sequential Probability
The KT probability of a sequence with a zeros and b ones:
Pₑ(a, b) = Γ(a + 0.5) × Γ(b + 0.5) / (Γ(1) × Γ(a + b + 1))
This can be computed sequentially:
Pₑ(a+1, b) = Pₑ(a, b) × (a + 0.5) / (a + b + 1)
Pₑ(a, b+1) = Pₑ(a, b) × (b + 0.5) / (a + b + 1)
No need to restart from scratch—just multiply by a simple factor.
CTW’s Bayesian Mixture
Context Tree Weighting uses the KT estimator at each node, then performs Bayesian averaging over tree structures.
At each node with local KT estimate Pₑ and children with weighted estimates P_w(child_0) and P_w(child_1):
P_w = 0.5 × Pₑ + 0.5 × P_w(child_0) × P_w(child_1)
This recursion:
- Assigns 0.5 prior weight to “stop here” (use local KT estimate)
- Assigns 0.5 prior weight to “go deeper” (use children’s predictions)
- Implicitly averages over all possible tree structures
Example: Depth-2 Tree
For a depth-2 CTW, the root’s P_w averages over:
- Use only root’s KT estimate (weight ~0.5)
- Use root + one level of children (weight ~0.25 for each split pattern)
- Use full depth-2 tree (weight ~0.125 for each of 4 possible trees)
All 2^(2^D - 1) possible tree structures are implicitly weighted!
Why Bayesian Works
Automatic Complexity Control
- Simple models (shallow trees) are preferred a priori via the 0.5 split
- But if data supports complexity, posterior shifts to deeper trees
- No need to hand-tune regularization parameters
Calibrated Uncertainty
- Predictions reflect actual epistemic uncertainty
- Important for decision-making, active learning, exploration
- Better than overconfident point estimates
Theoretical Guarantees
For tree sources, CTW achieves:
- Redundancy bounded by O(k log n) for k-leaf optimal tree
- Converges to true distribution at optimal rate
- Minimax optimal within its model class
Bayesian vs. Frequentist
| Aspect | Bayesian (CTW) | Frequentist (MLE) |
|---|---|---|
| Parameters | Distribution over values | Point estimate |
| Model selection | Average over models | Choose one |
| Small samples | Handles gracefully | Often overfits |
| Computation | Sometimes tractable | Usually simpler |
| Guarantees | Posterior consistency | Asymptotic normality |
For sequential prediction with limited data, the Bayesian approach dominates.
Practical Implications
When to Use Bayesian Model Averaging
- Limited data: Uncertainty matters more
- Multiple plausible models: No clear winner
- Calibration required: Need honest uncertainty
- Online learning: Sequential updates natural
When Simpler Approaches Suffice
- Massive data: Posterior concentrates anyway
- Single clearly best model: Averaging unnecessary
- Computation constrained: MLE is faster
- Interpretability: Single model easier to explain
The Hierarchy of Estimation
Full Bayesian (maintain full posterior)
↓
MAP (maximum a posteriori point estimate)
↓
MLE (maximum likelihood, flat prior)
↓
Heuristics (ad-hoc smoothing)
CTW operates at the top level—full Bayesian averaging over tree structures with KT estimation at leaves.
Key Takeaways
- Model averaging beats model selection when uncertainty matters
- The KT estimator provides optimal sequential Bernoulli prediction
- CTW combines KT estimation with Bayesian averaging over trees
- The 0.5/0.5 split implements an implicit Occam’s razor prior
- Bayesian methods excel with limited data and online learning
What’s Next
The next post examines the data generating processes that produce sequences. Understanding Markov processes and tree sources helps us see why CTW’s model class is both tractable and powerful.
Further Reading
- Gelman, A., et al. (2013). Bayesian Data Analysis.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Chapter 3.
- Rissanen, J. (1984). “Universal coding, information, prediction, and estimation.”
- Willems, F. M. J., et al. (1995). “The context-tree weighting method: basic properties.”
Discussion