N-gram Language Models

August 15, 2016 5 min read Updated: February 24, 2026

Before neural networks dominated NLP, n-gram models were the workhorse of language modeling. They are still foundational. Understanding n-grams gives you intuition for what modern LLMs are doing, and their limitations explain why both CTW and neural approaches were needed.

The N-gram Idea

N-gram models make a Markov assumption: only the last N-1 symbols matter.

P(xn | xn-1, xn-2, ..., x1) ~ P(xn | xn-1, ..., xn-N+1)

For a bigram (N=2) model over words:

P(word | prev_word) = count(prev_word, word) / count(prev_word)

For a trigram (N=3):

P(word | prev2, prev1) = count(prev2, prev1, word) / count(prev2, prev1)

It is just counting how often sequences appear, then normalizing.

The Sparsity Problem

As N increases, n-gram models face severe data sparsity:

N	Context size	English vocabulary	Possible contexts
2	1 word	50,000	50,000
3	2 words	50,000	2.5 billion
4	3 words	50,000	125 trillion

Most contexts never appear in any training corpus. Maximum likelihood gives P=0 for unseen contexts, which means:

Infinite perplexity on test data
Zero probability for reasonable sentences
Catastrophic failures in practice

Smoothing: Handling Unseen Events

Smoothing techniques reallocate probability mass to unseen events.

Laplace (Add-One) Smoothing

The simplest fix:

P(w | context) = (count(context, w) + 1) / (count(context) + |V|)

Problems:

Adds too much mass to unseen events
Assumes all unseen events are equally likely
Performance is mediocre

Add-k Smoothing

Generalized Laplace:

P(w | context) = (count(context, w) + k) / (count(context) + k * |V|)

Tune k on held-out data. Still relatively crude.

Good-Turing Smoothing

A more principled approach based on frequency of frequencies:

Estimate probability of unseen events from events seen exactly once
Events seen r times: estimate their true frequency from events seen r+1 times

Good-Turing provides better estimates but is mathematically intricate.

Kneser-Ney Smoothing

The gold standard for n-grams. Two key insights:

Absolute discounting: Subtract a fixed delta from all counts, redistribute to unseen
Continuation probability: For backoff, use how many different contexts a word appears in, not raw frequency

Kneser-Ney accounts for the fact that common words appear in many contexts while some words appear frequently but only in specific contexts.

Backoff and Interpolation

When the context has not been seen, we need fallback strategies.

Stupid Backoff

Simple but effective (used in Google’s web-scale LM):

P(w | ABC) =
    if count(ABC) > 0: count(ABCw) / count(ABC)
    else: 0.4 * P(w | BC)  # backoff with fixed discount

Katz Backoff

Reserve probability mass for unseen events using Good-Turing:

P(w | ABC) =
    if count(ABCw) > 0: discounted_count(ABCw) / count(ABC)
    else: alpha(ABC) * P(w | BC)

Where alpha(ABC) normalizes the leftover probability mass.

Interpolation

Always mix predictions from all context lengths:

P(w | AB) = lambda_3 * P(w | AB) + lambda_2 * P(w | B) + lambda_1 * P(w)

Where lambda_1 + lambda_2 + lambda_3 = 1, typically tuned on held-out data.

N-grams vs. CTW

CTW can be viewed as a principled form of interpolation with learned weights:

Aspect	N-gram	CTW
Mixing weights	Fixed/tuned (lambda_i)	Learned from data
Context order	Fixed N	Adaptive per context
Smoothing	Heuristic (Kneser-Ney)	Principled (KT estimator)
Guarantees	Empirical	Theoretical bounds
Model class	Fixed-order Markov	All tree sources

The key insight: CTW automatically learns appropriate interpolation weights based on how well each context depth predicts. It is what n-gram smoothing is trying to approximate, done optimally.

Strengths of N-grams

Despite limitations, n-grams have real advantages:

Simple: Just counting and division
Fast: Single pass over training data
Interpretable: Can examine specific n-gram probabilities
Efficient: Hash tables or tries for storage
Baseline: Hard to beat for some tasks with limited compute

Limitations

Fixed context: Cannot capture dependencies beyond N-1
No generalization: “cat sat” and “dog sat” share nothing
Sparsity: Higher N = exponentially more unseen contexts
No compositionality: No notion of syntax or semantics

These limitations motivated neural approaches that learn dense representations where similar contexts share information.

Binary vs. Text

For binary sequences (CTW’s domain):

Vocabulary |V| = 2
Sparsity less severe: only 2^N contexts for N-gram
CTW is essentially an optimal binary n-gram with learned interpolation

For text:

|V| ~ 10,000 - 100,000 words
Sparsity is severe: 10,000^3 = 10^12 trigram contexts
Subword tokenization (BPE) helps
Neural models now dominate

Code Example

A simple bigram model:

from collections import defaultdict

class BigramModel:
    def __init__(self, alpha=1.0):
        self.counts = defaultdict(lambda: defaultdict(int))
        self.context_counts = defaultdict(int)
        self.alpha = alpha  # smoothing
        self.vocab_size = 2  # binary

    def train(self, sequence):
        for i in range(1, len(sequence)):
            context = sequence[i-1]
            symbol = sequence[i]
            self.counts[context][symbol] += 1
            self.context_counts[context] += 1

    def predict(self, context):
        """Return P(next=1 | context)."""
        c0 = self.counts[context][0]
        c1 = self.counts[context][1]
        total = c0 + c1
        # Add-alpha smoothing
        return (c1 + self.alpha) / (total + self.alpha * self.vocab_size)

Compare to CTW: same basic structure (count statistics per context) but with principled Bayesian machinery on top.

Historical Impact

N-grams were the foundation of:

Speech recognition (before the neural era)
Machine translation (phrase-based MT)
Spelling correction
Text compression (PPM family)

Understanding n-grams provides intuition for what neural language models are learning, just with representations that allow generalization across similar contexts.

Key Takeaways

N-grams estimate P(next | context) by counting
Sparsity is the fundamental challenge
Smoothing (Kneser-Ney) redistributes mass to unseen events
Interpolation/backoff combines multiple context lengths
CTW is like optimal n-gram interpolation with learned weights
Limitations motivate neural approaches for text

What’s Next

The next post compares CTW, n-grams, and neural language models head-to-head, examining when each approach excels and the fundamental trade-offs between sample efficiency and flexibility.

N-gram Language Models

The N-gram Idea

The Sparsity Problem

Smoothing: Handling Unseen Events

Laplace (Add-One) Smoothing

Add-k Smoothing

Good-Turing Smoothing

Kneser-Ney Smoothing

Backoff and Interpolation

Stupid Backoff

Katz Backoff

Interpolation

N-grams vs. CTW

Strengths of N-grams

Limitations

Binary vs. Text

Code Example

Historical Impact

Key Takeaways

What’s Next

Further Reading

Discussion

The N-gram Idea

The Sparsity Problem

Smoothing: Handling Unseen Events

Laplace (Add-One) Smoothing

Add-k Smoothing

Good-Turing Smoothing

Kneser-Ney Smoothing

Backoff and Interpolation

Stupid Backoff

Katz Backoff

Interpolation

N-grams vs. CTW

Strengths of N-grams

Limitations

Binary vs. Text

Code Example

Historical Impact

Key Takeaways

What’s Next

Further Reading

Related Posts

Neural Language Models: From RNNs to Transformers

Comparing Prediction Methods: CTW vs. N-grams vs. Neural LMs

Data Generating Processes for Sequential Prediction

The Bayesian Prediction Framework

Solomonoff Induction: The Theoretical Ideal

Discussion