Infinigram (pip install py-infinigram) is a corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching. Unlike neural language models, Infinigram provides instant training—the corpus is the model.
The Core Idea
Traditional n-gram models use fixed context lengths and suffer from exponential memory growth. A 5-gram model over a vocabulary of 50,000 words requires storing up to \(50000^5\) possible patterns—roughly 312 petabytes.
Infinigram sidesteps this by using suffix arrays:
- O(n) space: Linear in corpus size, not vocabulary size
- O(m log n) queries: Fast pattern matching for any context length
- Variable-length matching: Automatically uses the longest matching context
For a 1B token corpus, this means ~1GB instead of ~34GB for hash-based 5-grams.
How It Works
Given a context, Infinigram finds the longest matching suffix in the training corpus:
from infinigram import Infinigram
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Find longest match for context [2, 3]
position, length = model.longest_suffix([2, 3])
# Predict next token
probs = model.predict([2, 3])
# {4: 0.66, 5: 0.33, ...}
Predictions come from observing what tokens follow the matched pattern in the corpus.
LLM Probability Mixing
The key practical application is grounding LLM outputs without retraining:
# Mix LLM with corpus-based predictions
P_final = alpha * P_llm + (1 - alpha) * P_infinigram
This provides:
- Domain adaptation without fine-tuning (e.g., legal corpus for legal text)
- Hallucination reduction by anchoring to actual corpus content
- Explainability: every prediction traces to specific corpus evidence
Projections as Inductive Biases
The paper introduces a theoretical framework viewing inductive biases as projections—transformations applied to queries or training data that enable generalization:
- Runtime transforms: Lowercase normalization, stemming, synonym expansion
- Corpus augmentations: Data augmentation, paraphrasing
This provides a principled approach to out-of-distribution generalization in corpus-based models.
Interactive REPL
Infinigram includes an interactive REPL for exploration:
infinigram-repl
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
infinigram [demo]> /predict the cat
infinigram [demo]> /complete the cat --max 20
Future: LangCalc Integration
Infinigram is designed to work with LangCalc, an algebraic framework for language model composition:
# Compose models algebraically
model = 0.7 * llm + 0.2 * wiki_infinigram + 0.1 * code_infinigram
This enables mixing neural and corpus-based models with explicit control over domain influence.
Resources
- PyPI:
pip install py-infinigram - GitHub: github.com/queelius/infinigram
- Paper: Infinigram Technical Report
- Related: LangCalc Project
Background
This work builds on ideas I explored in a SLUUG talk on LLMs, where I demonstrated arbitrary-size n-grams using recursive dictionaries for a crude expression evaluator. Infinigram takes these ideas further with efficient suffix array implementation and the projection framework for generalization.
Discussion