Skip to main content

CTW Experimental Results: Theory Meets Practice

Throughout this series, we’ve developed the theory of sequential prediction—from Solomonoff’s incomputable ideal to CTW’s practical Bayesian approach. Now we validate theory with experiments.

This post presents experimental results showing that CTW achieves the Bayes-optimal accuracy when its depth matches the true Markov order. But the path to this finding included a bug that initially suggested the opposite—a reminder that even simple experiments can mislead.

The Experimental Setup

Data Generating Process

We use sparse Markov processes where most contexts are uninformative:

3rd-order Markov process:
Context  P(next=1)  Description
───────  ─────────  ──────────────────
000      0.99       High certainty → 1
100      0.99       High certainty → 1
010      0.025      High certainty → 0
101      0.025      High certainty → 0
011      0.25       Moderate → 0
110      0.25       Moderate → 0
001      0.5        Uniform (random)
111      0.5        Uniform (random)

This process has:

  • 4 highly predictable contexts (~99% or ~97.5% certainty)
  • 2 moderately predictable contexts (~75% certainty)
  • 2 completely random contexts (50/50)

Theoretical Maximum Accuracy

An oracle predictor that knows the true transition probabilities achieves:

def theoretical_max(transition_probs, stationary_dist):
    """Predict the more likely outcome for each context."""
    return sum(
        stationary_dist[ctx] * max(p, 1-p)
        for ctx, p in transition_probs.items()
    )

For our process, this is approximately 95%—the ceiling for any predictor.

Configuration

  • Training length: 50,000 bits
  • Test length: 10,000 bits
  • Seeds: 5 random seeds (averaged)
  • DGP orders: 1, 2, 3, 4, 5, 6
  • CTW depths: 1, 2, 3, 4, 5, 6, 8, 10

The Bug: A Cautionary Tale

Initial (Incorrect) Results

An early draft of this work reported surprising findings:

Max DepthAccuracy (%)
187.2
296.8 (peak)
393.8
489.4
876.2
1266.4

This suggested that accuracy decreased when CTW depth exceeded the true order. The paper hypothesized:

  • Data sparsity at deeper contexts degraded KT estimates
  • Overfitting to rare deep patterns
  • Bayesian dilution as mixture flattened

The Problem

These results were wrong. A bug in the evaluation code caused accuracy to degrade artificially at higher depths.

The Fix

After careful debugging:

  1. Corrected the context indexing in the test harness
  2. Fixed how predictions were accumulated across depths
  3. Verified KT estimator updates were correctly propagated

Corrected Results

With the bug fixed, the true behavior emerged:

DGP OrderTheo MaxD=1D=2D=3D=4D=5D=6
1~95%
2~95%
3~95%
4~95%
5~95%
6~95%

✓ = Achieves theoretical maximum ✗ = Below theoretical maximum

Key Finding: No Overfitting Penalty

When CTW depth ≥ DGP order, CTW achieves Bayes-optimal accuracy.

The corrected experiments confirm:

  1. Depth < Order: Accuracy limited because CTW can’t represent the true model
  2. Depth = Order: Achieves theoretical maximum
  3. Depth > Order: Also achieves theoretical maximum (no degradation!)

The third point is crucial: increasing depth beyond the true order does not hurt performance.

Why This Makes Sense

CTW’s Bayesian model averaging naturally handles excessive depth:

The Prior Penalizes Depth

At each node, CTW assigns:

  • 0.5 weight to “stop here” (use local KT estimate)
  • 0.5 weight to “go deeper” (use children)

This compounds: a depth-D path has prior weight ~0.5^D. Deep trees are automatically downweighted.

Sparse Data at Depth is Handled

When deep contexts have few samples:

  • KT estimator gives high uncertainty
  • Bayesian mixture shifts weight toward shallower estimates
  • Predictions remain calibrated

Model Averaging is Robust

Even with depth set too high:

  • The true model is in the mixture
  • Posterior concentrates on the right tree structure
  • Excess complexity is averaged away, not memorized

Practical Implications

Depth Selection

  • When true order is known: Set depth = order (or slightly higher)
  • When unknown: Set depth conservatively high
  • No penalty for overestimating: Better to err toward deeper

Data Requirements

Higher depths require more data to populate deep contexts, but CTW handles sparse contexts gracefully. Rule of thumb: need ~10× samples as number of contexts at target depth.

Confidence in Theory

The experiments validate CTW’s theoretical properties:

  • Bayesian model averaging works as advertised
  • The 0.5/0.5 prior provides appropriate regularization
  • Optimality for tree sources holds empirically

Lessons Learned

1. Trust but Verify

The initial “surprising” result seemed plausible—deep models overfitting is a common story. But:

  • It contradicted well-established theory
  • The magnitude of degradation was suspicious
  • Independent implementation checks were skipped

2. Theory as a Guide

When experiments contradict established theory, the experiment is often wrong. Theory provides expectations; deviations deserve scrutiny.

3. Reproducibility Matters

The bug was found when trying to reproduce results with a different implementation. If we’d only run the original code, the error might have persisted.

4. Simple Experiments Can Fail

This wasn’t a complex distributed system. It was a few hundred lines of Python evaluating a well-understood algorithm. Bugs hide everywhere.

Running the Experiments

To reproduce:

from experiments.ctw_multiorder import run_multiorder_experiment

results = run_multiorder_experiment(
    dgp_orders=[1, 2, 3, 4, 5, 6],
    ctw_depths=[1, 2, 3, 4, 5, 6, 8, 10],
    markov_type="sparse",
    train_length=50000,
    test_length=10000,
    num_seeds=5,
)

Or via command line:

# Quick sanity check
python -m experiments.ctw.scripts.ctw_multiorder --quick

# Full experiment
python -m experiments.ctw.scripts.ctw_multiorder

Summary of Results

ClaimStatus
CTW achieves theoretical max when depth ≥ orderConfirmed
CTW degrades with excessive depthRefuted (was a bug)
Bayesian model averaging provides robustnessConfirmed
CTW is optimal for tree sourcesConfirmed (empirically)

Series Conclusion

This series traced sequential prediction from Solomonoff’s incomputable ideal to practical algorithms:

  1. The problem: Predict the next symbol given history
  2. The ideal: Solomonoff induction—optimal but incomputable
  3. The framework: Bayesian model averaging over hypotheses
  4. The model class: Markov processes and tree sources
  5. The classical approach: N-grams and their limitations
  6. The trade-offs: Sample efficiency vs. flexibility
  7. The modern approach: Neural language models at scale
  8. The validation: CTW achieves its theoretical guarantees

CTW represents a sweet spot: provably optimal within a rich model class, computationally efficient, and interpretable. For the right problems—binary sequences, limited data, need for guarantees—it remains the right choice even in the age of LLMs.

Further Reading

  • Willems, F. M. J., et al. (1995). “The context-tree weighting method: basic properties.”
  • The code repository contains the full CTW implementation and experimental scripts.

Discussion