Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

June 15, 2024 5 min read Updated: February 24, 2026

Previous posts in this series covered the theoretical foundations (Solomonoff, Bayesian prediction), the model classes (Markov processes, tree sources), and classical approaches (n-grams). Now I want to step back and compare the major approaches to sequential prediction head-to-head.

I’m not going to declare a winner. Each method is best in a different regime. The useful insight is knowing when to reach for which tool.

Summary

Aspect	CTW	N-gram	Neural LM
Sample efficiency	Excellent	Good	Poor
Scalability	Limited	Good	Excellent
Inductive bias	Tree structures	Local context	Learned
Interpretability	High	High	Low
Theoretical guarantees	Strong	Moderate	Weak
Compute requirements	Minimal	Low	Massive
Context modeling	Adaptive depth	Fixed length	Flexible
Vocabulary handling	Binary optimal	Any size	Any size

Sample efficiency

CTW learns from hundreds to thousands of examples. It has a strong inductive bias (tree structures are rich but constrained), a Bayesian prior that regularizes automatically, and the KT estimator which is statistically optimal. For a 3rd-order Markov process with 8 contexts, CTW gets near-optimal accuracy from about 1000 samples.

N-grams need moderate data. Count statistics converge reasonably fast, smoothing handles sparsity, and higher orders need proportionally more observations.

Neural language models are data-hungry. Modern LLMs train on hundreds of billions of tokens because they must learn all structure from data, with essentially no built-in inductive biases for sequence structure. Transfer learning helps but doesn’t eliminate the fundamental requirement.

Scalability

CTW hits walls. The context tree has 2^D possible contexts at depth D (exponential growth), pure binary encoding is the sweet spot, and memory grows with sequence diversity. Works well for binary sequences up to moderate depth, maybe 10-15.

N-grams scale reasonably. O(|V|^N) space for vocabulary V and order N, efficient implementations with tries and hashing, practical up to about 5-gram for large vocabularies. Google built web-scale n-gram models, so the engineering is well understood.

Neural LMs were designed for scale. Billions of parameters, GPU/TPU parallelization, vocabulary of 50,000+ tokens. The entire field has optimized for making neural models bigger.

Inductive biases

CTW assumes the source is well-modeled by a tree. This is both its strength (sample efficiency, provable guarantees) and its limitation (it can miss patterns outside tree structure). There’s an implicit Occam’s razor that prefers simpler trees.

N-grams assume all contexts of the same length are equally relevant. Can’t adapt context length to data. Smoothing adds a bias toward shorter contexts.

Neural LMs learn their biases from data. Attention patterns emerge during training, positional encodings provide some structure, but the model is very flexible and needs data to specialize. The transformer architecture encodes some bias (attention, residual connections) but much less than CTW or n-grams.

Theoretical guarantees

For tree sources, CTW is provably optimal. Redundancy bounded by O(k log n) for a k-leaf tree, convergence to the true distribution at minimax optimal rate, equivalent to Solomonoff induction restricted to trees. These hold with finite data, not just asymptotically.

N-grams have some theory: MLE is consistent, there are cross-entropy bounds with smoothing, but no optimality results for specific source classes.

Neural LMs have weak theory. Universal approximation holds in principle with enough capacity, but there are no sample complexity bounds in realistic settings. We use neural LMs because they work, not because we can prove they should.

Interpretability

The CTW context tree is directly readable. You can see which contexts matter, how confident the model is at each node, and how the model averaging weights show depth preferences.

N-grams give direct access to probability tables: which n-grams are common, what’s P(word | context). Clear model of what’s being captured.

Neural LMs are opaque. Hidden representations are hard to interpret. Attention patterns provide some insight, and mechanistic interpretability is an active research area, but these models remain largely black boxes.

When to use each

CTW is the right tool when: data is limited (under 100K samples), sequences are binary or small-alphabet, you need provable guarantees (safety-critical applications), interpretability matters, compute is constrained, or the source is likely tree-structured. Think binary file compression, network protocol analysis, genomic sequences, embedded systems.

N-grams are the right tool when: you need a baseline, fast training and inference are important, moderate data is available, deployment must be simple, or you want interpretable results. Think prototyping, spell checking, simple autocomplete, PPM compression.

Neural LMs are the right tool when: massive training data is available, you need state-of-the-art performance, complex patterns are expected (natural language, code), long-range dependencies matter, and you can afford the compute. Think chat assistants, machine translation, code generation.

The bias-data trade-off

This is the fundamental relationship:

Strong bias (CTW) + Little data    -> Good performance
Weak bias (Neural) + Little data   -> Overfitting
Weak bias (Neural) + Massive data  -> Great performance
Strong bias (CTW) + Massive data   -> Limited by model class

CTW trades flexibility for sample efficiency. Neural LMs make the opposite trade. Neither trade is universally correct.

The compression-prediction duality

All of these models can be viewed as compressors. CTW gives near-optimal compression for tree sources. N-grams underlie the PPM compression family. Neural LMs achieve state-of-the-art compression rates. Better prediction equals better compression equals lower cross-entropy. Shannon showed these are mathematically identical, not just analogous.

What’s next

The next post gets into neural language models specifically, from RNNs to transformers to modern LLMs, and how the architecture evolved to handle sequential prediction at scale.

Summary

Sample efficiency

Scalability

Inductive biases

Theoretical guarantees

Interpretability

When to use each

The bias-data trade-off

The compression-prediction duality

What’s next

Related Posts

Neural Language Models: From RNNs to Transformers

CTW Experimental Results: Theory Meets Practice

N-gram Language Models

The Bayesian Prediction Framework

Data Generating Processes for Sequential Prediction

Discussion