We’ve now covered the theoretical foundations (Solomonoff, Bayesian prediction), the model classes (Markov processes, tree sources), and classical approaches (n-grams). In this post, we step back to compare the major approaches to sequential prediction head-to-head.
The goal isn’t to declare a winner—each method excels in different regimes. Understanding when to use each is the real insight.
Summary Comparison
| Aspect | CTW | N-gram | Neural LM |
|---|---|---|---|
| Sample efficiency | Excellent | Good | Poor |
| Scalability | Limited | Good | Excellent |
| Inductive bias | Tree structures | Local context | Learned |
| Interpretability | High | High | Low |
| Theoretical guarantees | Strong | Moderate | Weak |
| Compute requirements | Minimal | Low | Massive |
| Context modeling | Adaptive depth | Fixed length | Flexible |
| Vocabulary handling | Binary optimal | Any size | Any size |
Sample Efficiency
How much data does each method need?
CTW: Excellent
CTW learns from hundreds to thousands of examples. Why?
- Strong inductive bias: Tree structures are a rich but constrained class
- Bayesian prior: Automatically regularizes against overfitting
- Efficient parameter estimation: KT estimator is statistically optimal
For a 3rd-order Markov process with 8 contexts, CTW achieves near-optimal accuracy from ~1000 samples.
N-grams: Good
Moderate data requirements:
- Count statistics converge reasonably fast
- Smoothing handles sparse data
- Higher orders need proportionally more data
Neural LMs: Poor
Neural models are notoriously data-hungry:
- Modern LLMs train on hundreds of billions of tokens
- Must learn all structure from data
- No built-in inductive biases for sequence structure
Transfer learning (pretraining + fine-tuning) helps but doesn’t eliminate the fundamental requirement.
Scalability
How do the methods handle large problems?
CTW: Limited
CTW faces inherent scaling challenges:
- 2^D possible contexts at depth D (exponential)
- Pure binary—extensions to larger alphabets exist but less elegant
- Memory grows with sequence diversity
Works well for binary sequences up to moderate depth (~10-15).
N-grams: Good
Reasonable scaling properties:
- O(|V|^N) space for vocabulary V, order N
- Efficient implementations with tries, hashing
- Practical up to ~5-gram for large vocabularies
- Web-scale n-gram models exist (Google’s)
Neural LMs: Excellent
Designed for scale:
- Billions of parameters
- GPU/TPU parallelization
- Handle long contexts (though attention is O(n²))
- Vocabulary of 50,000+ tokens
The entire field has optimized for scaling neural models.
Inductive Biases
What assumptions does each method make?
CTW: Tree Structure Bias
CTW assumes the source is well-modeled by a tree:
- Perfect for Markov processes
- May miss patterns outside tree structure
- Implicit Occam’s razor (prefers simpler trees)
This is both a strength (sample efficiency) and limitation (can’t capture non-tree patterns).
N-grams: Fixed Context Length
N-grams assume all contexts of the same length are equally relevant:
- Works when assumption holds
- Can’t adapt context length to data
- Smoothing adds bias toward shorter contexts
Neural LMs: Learned Biases
Neural models learn their biases from data:
- Attention patterns emerge from training
- Positional encodings provide some structure
- Very flexible but need data to specify
The transformer architecture encodes some bias (attention, residual connections) but much less than CTW or n-grams.
Theoretical Guarantees
What can we prove about each method?
CTW: Strong Guarantees
For tree sources, CTW is provably optimal:
- Redundancy bounded by O(k log n) for k-leaf tree
- Converges to true distribution at minimax optimal rate
- Equivalent to Solomonoff induction restricted to trees
These aren’t just asymptotic results—they hold with finite data.
N-grams: Moderate Guarantees
Some theory exists:
- MLE is consistent (converges to true parameters)
- Cross-entropy bounds with smoothing
- No optimality for specific source classes
The guarantees are weaker than CTW’s.
Neural LMs: Weak Guarantees
Theory lags practice:
- Universal approximation (in principle, with enough capacity)
- No sample complexity bounds in realistic settings
- Generalization is mostly empirical
We use neural LMs because they work, not because we can prove they should.
Interpretability
Can we understand what the model learned?
CTW: High
The context tree is directly interpretable:
- Which contexts matter? Those with different statistics
- How confident? Counts at each node
- Model averaging weights show depth preferences
This transparency aids debugging and building trust.
N-grams: High
Direct access to probability tables:
- Which n-grams are common?
- What’s P(word | context)?
- Clear model of what’s being captured
Neural LMs: Low
Black boxes:
- Hidden representations hard to interpret
- Attention patterns provide some insight
- Active research area (mechanistic interpretability)
We’re improving, but neural LMs remain opaque.
When to Use Each
Use CTW When:
- Data is limited (< 100K samples)
- Binary or small alphabet sequences
- Strong guarantees needed (safety-critical applications)
- Interpretability required
- Compute is constrained
- Source is likely tree-structured (Markov processes)
Examples: Binary file compression, network protocol analysis, genomic sequences, embedded systems.
Use N-grams When:
- Baseline model needed (comparison point)
- Fast training/inference required
- Moderate data available
- Simple deployment
- Interpretability desired
Examples: Quick prototyping, spell checking, simple autocomplete, compression (PPM).
Use Neural LMs When:
- Massive training data available
- State-of-the-art performance needed
- Complex patterns expected (natural language, code)
- Long-range dependencies important
- Can afford compute costs
Examples: Chat assistants, machine translation, code generation, content creation.
The Bias-Data Trade-off
The fundamental relationship:
Strong bias (CTW) + Little data → Good performance
Weak bias (Neural) + Little data → Overfitting
Weak bias (Neural) + Massive data → Great performance
Strong bias (CTW) + Massive data → Limited by model class
CTW trades off flexibility for sample efficiency. Neural LMs trade the opposite.
The Compression-Prediction Duality
All these models can be viewed as compressors:
- CTW: Near-optimal compression for tree sources
- N-grams: Underlie PPM compression family
- Neural LMs: Achieve state-of-the-art compression rates
Better prediction = better compression = lower cross-entropy. This equivalence is more than analogy—Shannon showed they’re mathematically identical.
Future Directions
The boundaries between approaches are blurring:
- Hybrid models: Combine CTW’s efficiency with neural flexibility
- Bayesian neural networks: Add uncertainty to neural predictions
- Neural architecture search: Learn inductive biases
- Continual learning: Adapt online like CTW
- Theoretical understanding: Better bounds for neural models
Key Takeaways
- No universally best approach—the right choice depends on context
- CTW wins on sample efficiency and guarantees
- Neural LMs win on scalability and flexibility
- N-grams remain useful as baselines and for interpretability
- Understanding trade-offs is more valuable than declaring winners
What’s Next
The next post dives deeper into neural language models—from RNNs to transformers to modern LLMs. We’ll see how the architecture evolved to address sequential prediction at scale.
Discussion