Before neural networks dominated NLP, n-gram models were the workhorse of language modeling. They remain foundational—understanding n-grams provides intuition for what modern LLMs are doing, and reveals why their limitations motivated both CTW and deep learning approaches.
The N-gram Idea
N-gram models make a Markov assumption: only the last N-1 symbols matter.
P(xₙ | xₙ₋₁, xₙ₋₂, ..., x₁) ≈ P(xₙ | xₙ₋₁, ..., xₙ₋ₙ₊₁)
For a bigram (N=2) model over words:
P(word | prev_word) = count(prev_word, word) / count(prev_word)
For a trigram (N=3):
P(word | prev2, prev1) = count(prev2, prev1, word) / count(prev2, prev1)
It’s just counting how often sequences appear, then normalizing.
The Sparsity Problem
As N increases, n-gram models face severe data sparsity:
| N | Context size | English vocabulary | Possible contexts |
|---|---|---|---|
| 2 | 1 word | 50,000 | 50,000 |
| 3 | 2 words | 50,000 | 2.5 billion |
| 4 | 3 words | 50,000 | 125 trillion |
Most contexts never appear in any training corpus. Maximum likelihood gives P=0 for unseen contexts, leading to:
- Infinite perplexity on test data
- Zero probability for reasonable sentences
- Catastrophic failures in practice
Smoothing: Handling Unseen Events
Smoothing techniques reallocate probability mass to unseen events.
Laplace (Add-One) Smoothing
The simplest fix:
P(w | context) = (count(context, w) + 1) / (count(context) + |V|)
Problems:
- Adds too much mass to unseen events
- Assumes all unseen events equally likely
- Performance is mediocre
Add-k Smoothing
Generalized Laplace:
P(w | context) = (count(context, w) + k) / (count(context) + k × |V|)
Tune k on held-out data. Still relatively crude.
Good-Turing Smoothing
A more principled approach based on frequency of frequencies:
- Estimate probability of unseen events from events seen exactly once
- Events seen r times: estimate their true frequency from events seen r+1 times
Good-Turing provides better estimates but is mathematically intricate.
Kneser-Ney Smoothing
The gold standard for n-grams. Key insights:
- Absolute discounting: Subtract a fixed δ from all counts, redistribute to unseen
- Continuation probability: For backoff, use how many different contexts a word appears in, not raw frequency
Kneser-Ney accounts for the fact that common words appear in many contexts while some words appear frequently but only in specific contexts.
Backoff and Interpolation
When the context hasn’t been seen, we need fallback strategies.
Stupid Backoff
Simple but effective (used in Google’s web-scale LM):
P(w | ABC) =
if count(ABC) > 0: count(ABCw) / count(ABC)
else: 0.4 × P(w | BC) # backoff with fixed discount
Katz Backoff
Reserve probability mass for unseen events using Good-Turing:
P(w | ABC) =
if count(ABCw) > 0: discounted_count(ABCw) / count(ABC)
else: α(ABC) × P(w | BC)
Where α(ABC) normalizes the leftover probability mass.
Interpolation
Always mix predictions from all context lengths:
P(w | AB) = λ₃ × P(w | AB) + λ₂ × P(w | B) + λ₁ × P(w)
Where λ₁ + λ₂ + λ₃ = 1, typically tuned on held-out data.
N-grams vs. CTW
CTW can be viewed as a principled form of interpolation with learned weights:
| Aspect | N-gram | CTW |
|---|---|---|
| Mixing weights | Fixed/tuned (λᵢ) | Learned from data |
| Context order | Fixed N | Adaptive per context |
| Smoothing | Heuristic (Kneser-Ney) | Principled (KT estimator) |
| Guarantees | Empirical | Theoretical bounds |
| Model class | Fixed-order Markov | All tree sources |
Key insight: CTW automatically learns appropriate interpolation weights based on how well each context depth predicts. It’s what n-gram smoothing is trying to approximate, done optimally.
Strengths of N-grams
Despite limitations, n-grams have real advantages:
- Simple: Just counting and division
- Fast: Single pass over training data
- Interpretable: Can examine specific n-gram probabilities
- Efficient: Hash tables or tries for storage
- Baseline: Hard to beat for some tasks with limited compute
Limitations
- Fixed context: Can’t capture dependencies beyond N-1
- No generalization: “cat sat” and “dog sat” share nothing
- Sparsity: Higher N = exponentially more unseen contexts
- No compositionality: No notion of syntax or semantics
These limitations motivated neural approaches that learn dense representations where similar contexts share information.
Binary vs. Text
For binary sequences (CTW’s domain):
- Vocabulary |V| = 2
- Sparsity less severe: only 2^N contexts for N-gram
- CTW is essentially an optimal binary n-gram with learned interpolation
For text:
- |V| ~ 10,000 - 100,000 words
- Sparsity is severe: 10,000³ = 10¹² trigram contexts
- Subword tokenization (BPE) helps
- Neural models now dominate
Code Example
A simple bigram model:
from collections import defaultdict
class BigramModel:
def __init__(self, alpha=1.0):
self.counts = defaultdict(lambda: defaultdict(int))
self.context_counts = defaultdict(int)
self.alpha = alpha # smoothing
self.vocab_size = 2 # binary
def train(self, sequence):
for i in range(1, len(sequence)):
context = sequence[i-1]
symbol = sequence[i]
self.counts[context][symbol] += 1
self.context_counts[context] += 1
def predict(self, context):
"""Return P(next=1 | context)."""
c0 = self.counts[context][0]
c1 = self.counts[context][1]
total = c0 + c1
# Add-alpha smoothing
return (c1 + self.alpha) / (total + self.alpha * self.vocab_size)
Compare to CTW: same basic structure (count statistics per context) but with principled Bayesian machinery.
Historical Impact
N-grams were the foundation of:
- Speech recognition (before neural era)
- Machine translation (phrase-based MT)
- Spelling correction
- Text compression (PPM family)
Understanding n-grams provides intuition for what neural language models are learning—but with representations that allow generalization.
Key Takeaways
- N-grams estimate P(next | context) by counting
- Sparsity is the fundamental challenge
- Smoothing (Kneser-Ney) redistributes mass to unseen events
- Interpolation/backoff combines multiple context lengths
- CTW is like optimal n-gram interpolation with learned weights
- Limitations motivate neural approaches for text
What’s Next
The next post compares CTW, n-grams, and neural language models head-to-head, examining when each approach excels and the fundamental trade-offs between sample efficiency and flexibility.
Further Reading
- Jurafsky & Martin. Speech and Language Processing. Chapter 3.
- Chen, S. F., & Goodman, J. (1999). “An empirical study of smoothing techniques for language modeling.”
- Kneser, R., & Ney, H. (1995). “Improved backing-off for m-gram language modeling.”
Discussion