Skip to main content

Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

We’ve now covered the theoretical foundations (Solomonoff, Bayesian prediction), the model classes (Markov processes, tree sources), and classical approaches (n-grams). In this post, we step back to compare the major approaches to sequential prediction head-to-head.

The goal isn’t to declare a winner—each method excels in different regimes. Understanding when to use each is the real insight.

Summary Comparison

AspectCTWN-gramNeural LM
Sample efficiencyExcellentGoodPoor
ScalabilityLimitedGoodExcellent
Inductive biasTree structuresLocal contextLearned
InterpretabilityHighHighLow
Theoretical guaranteesStrongModerateWeak
Compute requirementsMinimalLowMassive
Context modelingAdaptive depthFixed lengthFlexible
Vocabulary handlingBinary optimalAny sizeAny size

Sample Efficiency

How much data does each method need?

CTW: Excellent

CTW learns from hundreds to thousands of examples. Why?

  • Strong inductive bias: Tree structures are a rich but constrained class
  • Bayesian prior: Automatically regularizes against overfitting
  • Efficient parameter estimation: KT estimator is statistically optimal

For a 3rd-order Markov process with 8 contexts, CTW achieves near-optimal accuracy from ~1000 samples.

N-grams: Good

Moderate data requirements:

  • Count statistics converge reasonably fast
  • Smoothing handles sparse data
  • Higher orders need proportionally more data

Neural LMs: Poor

Neural models are notoriously data-hungry:

  • Modern LLMs train on hundreds of billions of tokens
  • Must learn all structure from data
  • No built-in inductive biases for sequence structure

Transfer learning (pretraining + fine-tuning) helps but doesn’t eliminate the fundamental requirement.

Scalability

How do the methods handle large problems?

CTW: Limited

CTW faces inherent scaling challenges:

  • 2^D possible contexts at depth D (exponential)
  • Pure binary—extensions to larger alphabets exist but less elegant
  • Memory grows with sequence diversity

Works well for binary sequences up to moderate depth (~10-15).

N-grams: Good

Reasonable scaling properties:

  • O(|V|^N) space for vocabulary V, order N
  • Efficient implementations with tries, hashing
  • Practical up to ~5-gram for large vocabularies
  • Web-scale n-gram models exist (Google’s)

Neural LMs: Excellent

Designed for scale:

  • Billions of parameters
  • GPU/TPU parallelization
  • Handle long contexts (though attention is O(n²))
  • Vocabulary of 50,000+ tokens

The entire field has optimized for scaling neural models.

Inductive Biases

What assumptions does each method make?

CTW: Tree Structure Bias

CTW assumes the source is well-modeled by a tree:

  • Perfect for Markov processes
  • May miss patterns outside tree structure
  • Implicit Occam’s razor (prefers simpler trees)

This is both a strength (sample efficiency) and limitation (can’t capture non-tree patterns).

N-grams: Fixed Context Length

N-grams assume all contexts of the same length are equally relevant:

  • Works when assumption holds
  • Can’t adapt context length to data
  • Smoothing adds bias toward shorter contexts

Neural LMs: Learned Biases

Neural models learn their biases from data:

  • Attention patterns emerge from training
  • Positional encodings provide some structure
  • Very flexible but need data to specify

The transformer architecture encodes some bias (attention, residual connections) but much less than CTW or n-grams.

Theoretical Guarantees

What can we prove about each method?

CTW: Strong Guarantees

For tree sources, CTW is provably optimal:

  • Redundancy bounded by O(k log n) for k-leaf tree
  • Converges to true distribution at minimax optimal rate
  • Equivalent to Solomonoff induction restricted to trees

These aren’t just asymptotic results—they hold with finite data.

N-grams: Moderate Guarantees

Some theory exists:

  • MLE is consistent (converges to true parameters)
  • Cross-entropy bounds with smoothing
  • No optimality for specific source classes

The guarantees are weaker than CTW’s.

Neural LMs: Weak Guarantees

Theory lags practice:

  • Universal approximation (in principle, with enough capacity)
  • No sample complexity bounds in realistic settings
  • Generalization is mostly empirical

We use neural LMs because they work, not because we can prove they should.

Interpretability

Can we understand what the model learned?

CTW: High

The context tree is directly interpretable:

  • Which contexts matter? Those with different statistics
  • How confident? Counts at each node
  • Model averaging weights show depth preferences

This transparency aids debugging and building trust.

N-grams: High

Direct access to probability tables:

  • Which n-grams are common?
  • What’s P(word | context)?
  • Clear model of what’s being captured

Neural LMs: Low

Black boxes:

  • Hidden representations hard to interpret
  • Attention patterns provide some insight
  • Active research area (mechanistic interpretability)

We’re improving, but neural LMs remain opaque.

When to Use Each

Use CTW When:

  1. Data is limited (< 100K samples)
  2. Binary or small alphabet sequences
  3. Strong guarantees needed (safety-critical applications)
  4. Interpretability required
  5. Compute is constrained
  6. Source is likely tree-structured (Markov processes)

Examples: Binary file compression, network protocol analysis, genomic sequences, embedded systems.

Use N-grams When:

  1. Baseline model needed (comparison point)
  2. Fast training/inference required
  3. Moderate data available
  4. Simple deployment
  5. Interpretability desired

Examples: Quick prototyping, spell checking, simple autocomplete, compression (PPM).

Use Neural LMs When:

  1. Massive training data available
  2. State-of-the-art performance needed
  3. Complex patterns expected (natural language, code)
  4. Long-range dependencies important
  5. Can afford compute costs

Examples: Chat assistants, machine translation, code generation, content creation.

The Bias-Data Trade-off

The fundamental relationship:

Strong bias (CTW) + Little data → Good performance
Weak bias (Neural) + Little data → Overfitting
Weak bias (Neural) + Massive data → Great performance
Strong bias (CTW) + Massive data → Limited by model class

CTW trades off flexibility for sample efficiency. Neural LMs trade the opposite.

The Compression-Prediction Duality

All these models can be viewed as compressors:

  • CTW: Near-optimal compression for tree sources
  • N-grams: Underlie PPM compression family
  • Neural LMs: Achieve state-of-the-art compression rates

Better prediction = better compression = lower cross-entropy. This equivalence is more than analogy—Shannon showed they’re mathematically identical.

Future Directions

The boundaries between approaches are blurring:

  1. Hybrid models: Combine CTW’s efficiency with neural flexibility
  2. Bayesian neural networks: Add uncertainty to neural predictions
  3. Neural architecture search: Learn inductive biases
  4. Continual learning: Adapt online like CTW
  5. Theoretical understanding: Better bounds for neural models

Key Takeaways

  • No universally best approach—the right choice depends on context
  • CTW wins on sample efficiency and guarantees
  • Neural LMs win on scalability and flexibility
  • N-grams remain useful as baselines and for interpretability
  • Understanding trade-offs is more valuable than declaring winners

What’s Next

The next post dives deeper into neural language models—from RNNs to transformers to modern LLMs. We’ll see how the architecture evolved to address sequential prediction at scale.

Discussion