Skip to main content

Infinigram: Variable-Length N-grams via Suffix Arrays

Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, Infinigram provides instant training—the corpus is the model.

The Core Idea

Traditional n-gram models use fixed context lengths and suffer from exponential memory growth. A 5-gram model over a vocabulary of 50,000 words requires storing up to \(50000^5\) possible patterns—roughly 312 petabytes.

Infinigram sidesteps this by using suffix arrays:

  • O(n) space: Linear in corpus size, not vocabulary size
  • O(m log n) queries: Fast pattern matching for any context length
  • Variable-length matching: Automatically uses the longest matching context

For a 1B token corpus, this means ~1GB instead of ~34GB for hash-based 5-grams.

How It Works

Given a context, Infinigram finds the longest matching suffix in the training corpus:

from infinigram import Infinigram

corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])

# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}

Predictions come from observing what tokens follow the matched pattern in the corpus.

LLM Probability Mixing

The key practical application is grounding LLM outputs without retraining:

# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram

This provides:

  • Domain adaptation without fine-tuning (e.g., legal corpus for legal text)
  • Hallucination reduction by anchoring to actual corpus content
  • Explainability: every prediction traces to specific corpus evidence

Projections as Inductive Biases

The paper introduces a theoretical framework viewing inductive biases as projections—transformations applied to queries or training data that enable generalization:

  • Runtime transforms: Lowercase normalization, stemming, synonym expansion
  • Corpus augmentations: Data augmentation, paraphrasing

This provides a principled approach to out-of-distribution generalization in corpus-based models.

Interactive REPL

Infinigram includes an interactive REPL for exploration:

infinigram-repl

infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20

Future: LangCalc Integration

Infinigram is designed to work with LangCalc, an algebraic framework for language model composition:

# Compose models algebraically
model = 0.7 * llm + 0.2 * wiki_infinigram + 0.1 * code_infinigram

This enables mixing neural and corpus-based models with explicit control over domain influence.

Resources

Background

This work builds on ideas I explored in a SLUUG talk on LLMs, where I demonstrated arbitrary-size n-grams using recursive dictionaries for a crude expression evaluator. Infinigram takes these ideas further with efficient suffix array implementation and the projection framework for generalization.

Discussion