Skip to main content

Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

Previous posts in this series covered the theoretical foundations (Solomonoff, Bayesian prediction), the model classes (Markov processes, tree sources), and classical approaches (n-grams). Now I want to step back and compare the major approaches to sequential prediction head-to-head.

I’m not going to declare a winner. Each method is best in a different regime. The useful insight is knowing when to reach for which tool.

Summary

AspectCTWN-gramNeural LM
Sample efficiencyExcellentGoodPoor
ScalabilityLimitedGoodExcellent
Inductive biasTree structuresLocal contextLearned
InterpretabilityHighHighLow
Theoretical guaranteesStrongModerateWeak
Compute requirementsMinimalLowMassive
Context modelingAdaptive depthFixed lengthFlexible
Vocabulary handlingBinary optimalAny sizeAny size

Sample efficiency

CTW learns from hundreds to thousands of examples. It has a strong inductive bias (tree structures are rich but constrained), a Bayesian prior that regularizes automatically, and the KT estimator which is statistically optimal. For a 3rd-order Markov process with 8 contexts, CTW gets near-optimal accuracy from about 1000 samples.

N-grams need moderate data. Count statistics converge reasonably fast, smoothing handles sparsity, and higher orders need proportionally more observations.

Neural language models are data-hungry. Modern LLMs train on hundreds of billions of tokens because they must learn all structure from data, with essentially no built-in inductive biases for sequence structure. Transfer learning helps but doesn’t eliminate the fundamental requirement.

Scalability

CTW hits walls. The context tree has 2^D possible contexts at depth D (exponential growth), pure binary encoding is the sweet spot, and memory grows with sequence diversity. Works well for binary sequences up to moderate depth, maybe 10-15.

N-grams scale reasonably. O(|V|^N) space for vocabulary V and order N, efficient implementations with tries and hashing, practical up to about 5-gram for large vocabularies. Google built web-scale n-gram models, so the engineering is well understood.

Neural LMs were designed for scale. Billions of parameters, GPU/TPU parallelization, vocabulary of 50,000+ tokens. The entire field has optimized for making neural models bigger.

Inductive biases

CTW assumes the source is well-modeled by a tree. This is both its strength (sample efficiency, provable guarantees) and its limitation (it can miss patterns outside tree structure). There’s an implicit Occam’s razor that prefers simpler trees.

N-grams assume all contexts of the same length are equally relevant. Can’t adapt context length to data. Smoothing adds a bias toward shorter contexts.

Neural LMs learn their biases from data. Attention patterns emerge during training, positional encodings provide some structure, but the model is very flexible and needs data to specialize. The transformer architecture encodes some bias (attention, residual connections) but much less than CTW or n-grams.

Theoretical guarantees

For tree sources, CTW is provably optimal. Redundancy bounded by O(k log n) for a k-leaf tree, convergence to the true distribution at minimax optimal rate, equivalent to Solomonoff induction restricted to trees. These hold with finite data, not just asymptotically.

N-grams have some theory: MLE is consistent, there are cross-entropy bounds with smoothing, but no optimality results for specific source classes.

Neural LMs have weak theory. Universal approximation holds in principle with enough capacity, but there are no sample complexity bounds in realistic settings. We use neural LMs because they work, not because we can prove they should.

Interpretability

The CTW context tree is directly readable. You can see which contexts matter, how confident the model is at each node, and how the model averaging weights show depth preferences.

N-grams give direct access to probability tables: which n-grams are common, what’s P(word | context). Clear model of what’s being captured.

Neural LMs are opaque. Hidden representations are hard to interpret. Attention patterns provide some insight, and mechanistic interpretability is an active research area, but these models remain largely black boxes.

When to use each

CTW is the right tool when: data is limited (under 100K samples), sequences are binary or small-alphabet, you need provable guarantees (safety-critical applications), interpretability matters, compute is constrained, or the source is likely tree-structured. Think binary file compression, network protocol analysis, genomic sequences, embedded systems.

N-grams are the right tool when: you need a baseline, fast training and inference are important, moderate data is available, deployment must be simple, or you want interpretable results. Think prototyping, spell checking, simple autocomplete, PPM compression.

Neural LMs are the right tool when: massive training data is available, you need state-of-the-art performance, complex patterns are expected (natural language, code), long-range dependencies matter, and you can afford the compute. Think chat assistants, machine translation, code generation.

The bias-data trade-off

This is the fundamental relationship:

Strong bias (CTW) + Little data    -> Good performance
Weak bias (Neural) + Little data   -> Overfitting
Weak bias (Neural) + Massive data  -> Great performance
Strong bias (CTW) + Massive data   -> Limited by model class

CTW trades flexibility for sample efficiency. Neural LMs make the opposite trade. Neither trade is universally correct.

The compression-prediction duality

All of these models can be viewed as compressors. CTW gives near-optimal compression for tree sources. N-grams underlie the PPM compression family. Neural LMs achieve state-of-the-art compression rates. Better prediction equals better compression equals lower cross-entropy. Shannon showed these are mathematically identical, not just analogous.

What’s next

The next post gets into neural language models specifically, from RNNs to transformers to modern LLMs, and how the architecture evolved to handle sequential prediction at scale.

Discussion