Throughout this series, we’ve developed the theory of sequential prediction—from Solomonoff’s incomputable ideal to CTW’s practical Bayesian approach. Now we validate theory with experiments.
This post presents experimental results showing that CTW achieves the Bayes-optimal accuracy when its depth matches the true Markov order. But the path to this finding included a bug that initially suggested the opposite—a reminder that even simple experiments can mislead.
The Experimental Setup
Data Generating Process
We use sparse Markov processes where most contexts are uninformative:
3rd-order Markov process:
Context P(next=1) Description
─────── ───────── ──────────────────
000 0.99 High certainty → 1
100 0.99 High certainty → 1
010 0.025 High certainty → 0
101 0.025 High certainty → 0
011 0.25 Moderate → 0
110 0.25 Moderate → 0
001 0.5 Uniform (random)
111 0.5 Uniform (random)
This process has:
- 4 highly predictable contexts (~99% or ~97.5% certainty)
- 2 moderately predictable contexts (~75% certainty)
- 2 completely random contexts (50/50)
Theoretical Maximum Accuracy
An oracle predictor that knows the true transition probabilities achieves:
def theoretical_max(transition_probs, stationary_dist):
"""Predict the more likely outcome for each context."""
return sum(
stationary_dist[ctx] * max(p, 1-p)
for ctx, p in transition_probs.items()
)
For our process, this is approximately 95%—the ceiling for any predictor.
Configuration
- Training length: 50,000 bits
- Test length: 10,000 bits
- Seeds: 5 random seeds (averaged)
- DGP orders: 1, 2, 3, 4, 5, 6
- CTW depths: 1, 2, 3, 4, 5, 6, 8, 10
The Bug: A Cautionary Tale
Initial (Incorrect) Results
An early draft of this work reported surprising findings:
| Max Depth | Accuracy (%) |
|---|---|
| 1 | 87.2 |
| 2 | 96.8 (peak) |
| 3 | 93.8 |
| 4 | 89.4 |
| 8 | 76.2 |
| 12 | 66.4 |
This suggested that accuracy decreased when CTW depth exceeded the true order. The paper hypothesized:
- Data sparsity at deeper contexts degraded KT estimates
- Overfitting to rare deep patterns
- Bayesian dilution as mixture flattened
The Problem
These results were wrong. A bug in the evaluation code caused accuracy to degrade artificially at higher depths.
The Fix
After careful debugging:
- Corrected the context indexing in the test harness
- Fixed how predictions were accumulated across depths
- Verified KT estimator updates were correctly propagated
Corrected Results
With the bug fixed, the true behavior emerged:
| DGP Order | Theo Max | D=1 | D=2 | D=3 | D=4 | D=5 | D=6 |
|---|---|---|---|---|---|---|---|
| 1 | ~95% | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 2 | ~95% | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 3 | ~95% | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| 4 | ~95% | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ |
| 5 | ~95% | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
| 6 | ~95% | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
✓ = Achieves theoretical maximum ✗ = Below theoretical maximum
Key Finding: No Overfitting Penalty
When CTW depth ≥ DGP order, CTW achieves Bayes-optimal accuracy.
The corrected experiments confirm:
- Depth < Order: Accuracy limited because CTW can’t represent the true model
- Depth = Order: Achieves theoretical maximum
- Depth > Order: Also achieves theoretical maximum (no degradation!)
The third point is crucial: increasing depth beyond the true order does not hurt performance.
Why This Makes Sense
CTW’s Bayesian model averaging naturally handles excessive depth:
The Prior Penalizes Depth
At each node, CTW assigns:
- 0.5 weight to “stop here” (use local KT estimate)
- 0.5 weight to “go deeper” (use children)
This compounds: a depth-D path has prior weight ~0.5^D. Deep trees are automatically downweighted.
Sparse Data at Depth is Handled
When deep contexts have few samples:
- KT estimator gives high uncertainty
- Bayesian mixture shifts weight toward shallower estimates
- Predictions remain calibrated
Model Averaging is Robust
Even with depth set too high:
- The true model is in the mixture
- Posterior concentrates on the right tree structure
- Excess complexity is averaged away, not memorized
Practical Implications
Depth Selection
- When true order is known: Set depth = order (or slightly higher)
- When unknown: Set depth conservatively high
- No penalty for overestimating: Better to err toward deeper
Data Requirements
Higher depths require more data to populate deep contexts, but CTW handles sparse contexts gracefully. Rule of thumb: need ~10× samples as number of contexts at target depth.
Confidence in Theory
The experiments validate CTW’s theoretical properties:
- Bayesian model averaging works as advertised
- The 0.5/0.5 prior provides appropriate regularization
- Optimality for tree sources holds empirically
Lessons Learned
1. Trust but Verify
The initial “surprising” result seemed plausible—deep models overfitting is a common story. But:
- It contradicted well-established theory
- The magnitude of degradation was suspicious
- Independent implementation checks were skipped
2. Theory as a Guide
When experiments contradict established theory, the experiment is often wrong. Theory provides expectations; deviations deserve scrutiny.
3. Reproducibility Matters
The bug was found when trying to reproduce results with a different implementation. If we’d only run the original code, the error might have persisted.
4. Simple Experiments Can Fail
This wasn’t a complex distributed system. It was a few hundred lines of Python evaluating a well-understood algorithm. Bugs hide everywhere.
Running the Experiments
To reproduce:
from experiments.ctw_multiorder import run_multiorder_experiment
results = run_multiorder_experiment(
dgp_orders=[1, 2, 3, 4, 5, 6],
ctw_depths=[1, 2, 3, 4, 5, 6, 8, 10],
markov_type="sparse",
train_length=50000,
test_length=10000,
num_seeds=5,
)
Or via command line:
# Quick sanity check
python -m experiments.ctw.scripts.ctw_multiorder --quick
# Full experiment
python -m experiments.ctw.scripts.ctw_multiorder
Summary of Results
| Claim | Status |
|---|---|
| CTW achieves theoretical max when depth ≥ order | Confirmed |
| CTW degrades with excessive depth | Refuted (was a bug) |
| Bayesian model averaging provides robustness | Confirmed |
| CTW is optimal for tree sources | Confirmed (empirically) |
Series Conclusion
This series traced sequential prediction from Solomonoff’s incomputable ideal to practical algorithms:
- The problem: Predict the next symbol given history
- The ideal: Solomonoff induction—optimal but incomputable
- The framework: Bayesian model averaging over hypotheses
- The model class: Markov processes and tree sources
- The classical approach: N-grams and their limitations
- The trade-offs: Sample efficiency vs. flexibility
- The modern approach: Neural language models at scale
- The validation: CTW achieves its theoretical guarantees
CTW represents a sweet spot: provably optimal within a rich model class, computationally efficient, and interpretable. For the right problems—binary sequences, limited data, need for guarantees—it remains the right choice even in the age of LLMs.
Further Reading
- Willems, F. M. J., et al. (1995). “The context-tree weighting method: basic properties.”
- The code repository contains the full CTW implementation and experimental scripts.
Discussion