Skip to main content

N-gram Language Models

Before neural networks dominated NLP, n-gram models were the workhorse of language modeling. They remain foundational—understanding n-grams provides intuition for what modern LLMs are doing, and reveals why their limitations motivated both CTW and deep learning approaches.

The N-gram Idea

N-gram models make a Markov assumption: only the last N-1 symbols matter.

P(xₙ | xₙ₋₁, xₙ₋₂, ..., x₁) ≈ P(xₙ | xₙ₋₁, ..., xₙ₋ₙ₊₁)

For a bigram (N=2) model over words:

P(word | prev_word) = count(prev_word, word) / count(prev_word)

For a trigram (N=3):

P(word | prev2, prev1) = count(prev2, prev1, word) / count(prev2, prev1)

It’s just counting how often sequences appear, then normalizing.

The Sparsity Problem

As N increases, n-gram models face severe data sparsity:

NContext sizeEnglish vocabularyPossible contexts
21 word50,00050,000
32 words50,0002.5 billion
43 words50,000125 trillion

Most contexts never appear in any training corpus. Maximum likelihood gives P=0 for unseen contexts, leading to:

  • Infinite perplexity on test data
  • Zero probability for reasonable sentences
  • Catastrophic failures in practice

Smoothing: Handling Unseen Events

Smoothing techniques reallocate probability mass to unseen events.

Laplace (Add-One) Smoothing

The simplest fix:

P(w | context) = (count(context, w) + 1) / (count(context) + |V|)

Problems:

  • Adds too much mass to unseen events
  • Assumes all unseen events equally likely
  • Performance is mediocre

Add-k Smoothing

Generalized Laplace:

P(w | context) = (count(context, w) + k) / (count(context) + k × |V|)

Tune k on held-out data. Still relatively crude.

Good-Turing Smoothing

A more principled approach based on frequency of frequencies:

  • Estimate probability of unseen events from events seen exactly once
  • Events seen r times: estimate their true frequency from events seen r+1 times

Good-Turing provides better estimates but is mathematically intricate.

Kneser-Ney Smoothing

The gold standard for n-grams. Key insights:

  1. Absolute discounting: Subtract a fixed δ from all counts, redistribute to unseen
  2. Continuation probability: For backoff, use how many different contexts a word appears in, not raw frequency

Kneser-Ney accounts for the fact that common words appear in many contexts while some words appear frequently but only in specific contexts.

Backoff and Interpolation

When the context hasn’t been seen, we need fallback strategies.

Stupid Backoff

Simple but effective (used in Google’s web-scale LM):

P(w | ABC) =
    if count(ABC) > 0: count(ABCw) / count(ABC)
    else: 0.4 × P(w | BC)  # backoff with fixed discount

Katz Backoff

Reserve probability mass for unseen events using Good-Turing:

P(w | ABC) =
    if count(ABCw) > 0: discounted_count(ABCw) / count(ABC)
    else: α(ABC) × P(w | BC)

Where α(ABC) normalizes the leftover probability mass.

Interpolation

Always mix predictions from all context lengths:

P(w | AB) = λ₃ × P(w | AB) + λ₂ × P(w | B) + λ₁ × P(w)

Where λ₁ + λ₂ + λ₃ = 1, typically tuned on held-out data.

N-grams vs. CTW

CTW can be viewed as a principled form of interpolation with learned weights:

AspectN-gramCTW
Mixing weightsFixed/tuned (λᵢ)Learned from data
Context orderFixed NAdaptive per context
SmoothingHeuristic (Kneser-Ney)Principled (KT estimator)
GuaranteesEmpiricalTheoretical bounds
Model classFixed-order MarkovAll tree sources

Key insight: CTW automatically learns appropriate interpolation weights based on how well each context depth predicts. It’s what n-gram smoothing is trying to approximate, done optimally.

Strengths of N-grams

Despite limitations, n-grams have real advantages:

  1. Simple: Just counting and division
  2. Fast: Single pass over training data
  3. Interpretable: Can examine specific n-gram probabilities
  4. Efficient: Hash tables or tries for storage
  5. Baseline: Hard to beat for some tasks with limited compute

Limitations

  1. Fixed context: Can’t capture dependencies beyond N-1
  2. No generalization: “cat sat” and “dog sat” share nothing
  3. Sparsity: Higher N = exponentially more unseen contexts
  4. No compositionality: No notion of syntax or semantics

These limitations motivated neural approaches that learn dense representations where similar contexts share information.

Binary vs. Text

For binary sequences (CTW’s domain):

  • Vocabulary |V| = 2
  • Sparsity less severe: only 2^N contexts for N-gram
  • CTW is essentially an optimal binary n-gram with learned interpolation

For text:

  • |V| ~ 10,000 - 100,000 words
  • Sparsity is severe: 10,000³ = 10¹² trigram contexts
  • Subword tokenization (BPE) helps
  • Neural models now dominate

Code Example

A simple bigram model:

from collections import defaultdict

class BigramModel:
    def __init__(self, alpha=1.0):
        self.counts = defaultdict(lambda: defaultdict(int))
        self.context_counts = defaultdict(int)
        self.alpha = alpha  # smoothing
        self.vocab_size = 2  # binary

    def train(self, sequence):
        for i in range(1, len(sequence)):
            context = sequence[i-1]
            symbol = sequence[i]
            self.counts[context][symbol] += 1
            self.context_counts[context] += 1

    def predict(self, context):
        """Return P(next=1 | context)."""
        c0 = self.counts[context][0]
        c1 = self.counts[context][1]
        total = c0 + c1
        # Add-alpha smoothing
        return (c1 + self.alpha) / (total + self.alpha * self.vocab_size)

Compare to CTW: same basic structure (count statistics per context) but with principled Bayesian machinery.

Historical Impact

N-grams were the foundation of:

  • Speech recognition (before neural era)
  • Machine translation (phrase-based MT)
  • Spelling correction
  • Text compression (PPM family)

Understanding n-grams provides intuition for what neural language models are learning—but with representations that allow generalization.

Key Takeaways

  • N-grams estimate P(next | context) by counting
  • Sparsity is the fundamental challenge
  • Smoothing (Kneser-Ney) redistributes mass to unseen events
  • Interpolation/backoff combines multiple context lengths
  • CTW is like optimal n-gram interpolation with learned weights
  • Limitations motivate neural approaches for text

What’s Next

The next post compares CTW, n-grams, and neural language models head-to-head, examining when each approach excels and the fundamental trade-offs between sample efficiency and flexibility.

Further Reading

  • Jurafsky & Martin. Speech and Language Processing. Chapter 3.
  • Chen, S. F., & Goodman, J. (1999). “An empirical study of smoothing techniques for language modeling.”
  • Kneser, R., & Ney, H. (1995). “Improved backing-off for m-gram language modeling.”

Discussion