Infinigram
Variable-length n-gram language models using suffix arrays.
Resources & Distribution
Source Code
Package Registries
Infinigram: Variable-Length N-gram Language Models
Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching.
๐ฏ Overview
Infinigram is a language model that uses suffix arrays to efficiently find and score variable-length patterns in a corpus. Unlike traditional fixed-order n-grams, Infinigram automatically finds the longest matching context (up to a configurable maximum) and uses it for prediction.
Key advantages over traditional n-grams:
- ๐ Variable-length patterns: Automatically uses as much context as available
- ๐พ Memory efficient: O(n) space vs O(V^n) for hash-based n-grams
- ๐ฏ Accurate: Longer context = better predictions
- ๐ Dynamic updates: Can add new data to the corpus
- โก Fast queries: O(m log n) suffix array search
๐ Quick Start
Installation
pip install infinigram
Or install from source:
git clone https://github.com/yourusername/infinigram.git
cd infinigram
pip install -e .
Interactive REPL
The easiest way to explore Infinigram is through the interactive REPL:
# Launch the REPL
infinigram-repl
# Or using Python
python -m infinigram.repl
Quick REPL Tour:
# Create a dataset (datasets are models!)
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
# Make predictions
infinigram [demo]> /predict the cat
# Shows probability distribution for next byte
# Generate completions
infinigram [demo]> /complete the cat --max 20
# Apply augmentations for case-insensitive matching
infinigram [demo]> /augment lowercase uppercase
# Create multiple datasets and compare
infinigram [demo]> /dataset copy demo demo_backup
infinigram [demo]> /datasets
infinigram [demo]> /use demo_backup
# Use bash commands for data inspection
infinigram [demo]> !ls -lh data/
infinigram [demo]> !head data/sample.txt
# Type /help for all commands
infinigram [demo]> /help
See REPL_GUIDE.md for comprehensive documentation.
Basic Usage (Python API)
from infinigram import Infinigram
# Create a model from a corpus
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Predict next token
context = [2, 3]
probs = model.predict(context)
print(probs)
# {4: 0.6569, 5: 0.3301, 1: 0.0033, ...}
# Find longest matching suffix
position, length = model.longest_suffix(context)
print(f"Matched {length} tokens at position {position}")
# Get confidence score (0.0 to 1.0)
confidence = model.confidence(context)
print(f"Confidence: {confidence:.4f}")
# Update with new data
model.update([2, 3, 7, 8, 9])
Text Example
from infinigram import Infinigram
# Prepare text corpus
sentences = [
"the cat sat on the mat",
"the dog sat on the rug",
"the cat ran on the mat"
]
# Build vocabulary and tokenize
vocab = {}
corpus = []
for sent in sentences:
for word in sent.split():
if word not in vocab:
vocab[word] = len(vocab)
corpus.append(vocab[word])
# Create model
model = Infinigram(corpus, max_length=5)
# Predict after "the cat"
context = [vocab["the"], vocab["cat"]]
probs = model.predict(context)
# Convert predictions back to words
id_to_word = {v: k for k, v in vocab.items()}
for token_id, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:3]:
print(f"{id_to_word[token_id]}: {prob:.3f}")
๐ Key Features
Variable-Length Pattern Matching
Infinigram automatically finds the longest matching suffix in the corpus:
corpus = [1, 2, 3, 4, 5, 1, 2, 3, 6, 1, 2, 3, 4, 7]
model = Infinigram(corpus, max_length=10)
# Short context - finds exact match
model.longest_suffix([1, 2]) # Returns (0, 2)
# Longer context - finds longer match
model.longest_suffix([1, 2, 3, 4]) # Returns (0, 4)
Confidence Scoring
Longer matches provide higher confidence:
model.confidence([1]) # Low confidence (~0.07)
model.confidence([1, 2]) # Medium confidence (~0.15)
model.confidence([1, 2, 3, 4]) # High confidence (~0.29)
Dynamic Corpus Updates
Add new data without rebuilding from scratch:
model = Infinigram([1, 2, 3, 4, 5])
print(len(model.corpus)) # 5
model.update([6, 7, 8, 9])
print(len(model.corpus)) # 9
# Predictions automatically reflect new data
Efficient Suffix Arrays
The underlying suffix array provides:
- O(n log n) construction time
- O(m log n) query time (m = pattern length, n = corpus size)
- O(n) space complexity
- Binary search for pattern matching
๐ Performance
From benchmarks on various corpus sizes:
| Corpus Size | Construction | Avg Prediction | Avg Suffix Search |
|---|---|---|---|
| 100 tokens | 0.07 ms | 0.043 ms | 0.014 ms |
| 1K tokens | 6.09 ms | 0.390 ms | 0.184 ms |
| 10K tokens | 718 ms | 4.370 ms | 2.373 ms |
Memory efficiency: For a 1B token corpus:
- Hash-based 5-gram: ~34 GB
- Infinigram suffix array: ~1 GB
- 34x reduction in memory usage
๐ ๏ธ API Reference
Infinigram Class
class Infinigram:
def __init__(
self,
corpus: List[int],
max_length: Optional[int] = None,
min_count: int = 1,
smoothing: float = 0.01
)
Parameters:
corpus: List of integer token IDsmax_length: Maximum pattern length to consider (None = unlimited)min_count: Minimum occurrences for a pattern to be includedsmoothing: Laplace smoothing factor for unseen tokens
Methods
predict(context, top_k=50)
Predict probability distribution over next tokens.
Returns: Dict[int, float] - Token ID to probability mapping
longest_suffix(context)
Find longest matching suffix in corpus.
Returns: Tuple[int, int] - (position, length) of match
confidence(context)
Get confidence score based on match length.
Returns: float - Confidence score (0.0 to 1.0)
continuations(context, position, length)
Get all tokens that follow a matched pattern.
Returns: Counter[int] - Token counts
update(new_tokens)
Add new tokens to corpus and rebuild suffix array.
Parameters:
new_tokens: List of new token IDs to add
๐งช Testing
The package includes 36 comprehensive tests:
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=infinigram --cov-report=html
# Run specific test categories
pytest tests/test_suffix_array.py # Suffix array tests
pytest tests/test_infinigram.py # Infinigram tests
pytest tests/test_integration.py # Integration tests
Test coverage:
- โ Suffix array construction and queries
- โ Longest suffix matching
- โ Continuation probability computation
- โ Prediction with smoothing
- โ Confidence scoring
- โ Dynamic corpus updates
- โ Edge cases (empty corpus, long contexts)
- โ Integration scenarios (Wikipedia, code completion)
๐ Documentation
- API Reference - Complete API documentation
- Design Document - Architecture and implementation details
- Benchmarks - Performance analysis
๐ฌ Use Cases
Code Completion
# Train on source code
code_corpus = tokenize_source_files("src/**/*.py")
model = Infinigram(code_corpus, max_length=50)
# Complete code
context = tokenize("def factorial(n):")
suggestions = model.predict(context, top_k=5)
Text Generation
# Train on text corpus
text_corpus = tokenize_books(["book1.txt", "book2.txt"])
model = Infinigram(text_corpus, max_length=20)
# Generate text
context = tokenize("Once upon a time")
next_word = sample_from_distribution(model.predict(context))
Autocomplete
# Train on user queries
query_corpus = tokenize_queries(user_queries)
model = Infinigram(query_corpus, max_length=10)
# Suggest completions
partial_query = tokenize("how to make")
completions = model.predict(partial_query, top_k=10)
๐ค Integration
Infinigram can be used standalone or integrated with other systems:
With LangCalc
# Use as a language model in the LangCalc framework
from langcalc.models import LanguageModel
from infinigram import Infinigram
# Infinigram implements the LanguageModel interface
wiki = Infinigram(wikipedia_corpus, max_length=20)
news = Infinigram(news_corpus, max_length=15)
# Compose with other models
model = 0.7 * llm + 0.2 * wiki + 0.1 * news
Custom Tokenization
from infinigram import Infinigram
class TextInfinigram:
def __init__(self, texts, max_length=20):
# Build vocabulary
self.vocab = {}
corpus = []
for text in texts:
for token in self.tokenize(text):
if token not in self.vocab:
self.vocab[token] = len(self.vocab)
corpus.append(self.vocab[token])
self.id_to_token = {v: k for k, v in self.vocab.items()}
self.model = Infinigram(corpus, max_length=max_length)
def tokenize(self, text):
return text.lower().split()
def predict_text(self, context_text):
context = [self.vocab[t] for t in self.tokenize(context_text) if t in self.vocab]
probs = self.model.predict(context)
return {self.id_to_token[tid]: p for tid, p in probs.items()}
๐ Comparison with Traditional N-grams
| Feature | Traditional N-gram | Infinigram |
|---|---|---|
| Pattern Length | Fixed (n) | Variable (1 to max_length) |
| Memory | O(V^n) exponential | O(corpus_size) linear |
| Query Time | O(1) hash lookup | O(m log n) suffix search |
| Context Usage | Last n-1 tokens | Longest matching suffix |
| Updates | Fast (hash insert) | Slow (rebuild suffix array) |
| Best For | Frequent updates, small n | Large patterns, static/batch updates |
๐ Roadmap
- Incremental suffix array updates (avoid full rebuild)
- Compressed suffix arrays for large corpora
- Parallel suffix array construction
- GPU acceleration for batch predictions
- Integration with popular NLP libraries
- Pre-trained models for common domains
- Support for character-level and subword tokenization
- Streaming corpus updates
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
๐ License
MIT License - see LICENSE for details.
๐ Contact
- GitHub: yourusername/infinigram
- Issues: GitHub Issues
- Email: your.email@example.com
๐ Acknowledgments
Infinigram was originally developed as part of the LangCalc project, which explores algebraic frameworks for language model composition.
๐ References
- Suffix Arrays: Manber, U., & Myers, G. (1993). “Suffix arrays: a new method for on-line string searches”
- Variable-length n-grams in language modeling
- Efficient text indexing and retrieval
Built with โค๏ธ for efficient language modeling