active library

Infinigram

Variable-length n-gram language models using suffix arrays.

Started 2025 HTML

Resources & Distribution

Package Registries

Infinigram: Variable-Length N-gram Language Models

Python 3.8+ Tests License

Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching.

๐ŸŽฏ Overview

Infinigram is a language model that uses suffix arrays to efficiently find and score variable-length patterns in a corpus. Unlike traditional fixed-order n-grams, Infinigram automatically finds the longest matching context (up to a configurable maximum) and uses it for prediction.

Key advantages over traditional n-grams:

  • ๐Ÿš€ Variable-length patterns: Automatically uses as much context as available
  • ๐Ÿ’พ Memory efficient: O(n) space vs O(V^n) for hash-based n-grams
  • ๐ŸŽฏ Accurate: Longer context = better predictions
  • ๐Ÿ”„ Dynamic updates: Can add new data to the corpus
  • โšก Fast queries: O(m log n) suffix array search

๐Ÿš€ Quick Start

Installation

pip install infinigram

Or install from source:

git clone https://github.com/yourusername/infinigram.git
cd infinigram
pip install -e .

Interactive REPL

The easiest way to explore Infinigram is through the interactive REPL:

# Launch the REPL
infinigram-repl

# Or using Python
python -m infinigram.repl

Quick REPL Tour:

# Create a dataset (datasets are models!)
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat

# Make predictions
infinigram [demo]> /predict the cat
# Shows probability distribution for next byte

# Generate completions
infinigram [demo]> /complete the cat --max 20

# Apply augmentations for case-insensitive matching
infinigram [demo]> /augment lowercase uppercase

# Create multiple datasets and compare
infinigram [demo]> /dataset copy demo demo_backup
infinigram [demo]> /datasets
infinigram [demo]> /use demo_backup

# Use bash commands for data inspection
infinigram [demo]> !ls -lh data/
infinigram [demo]> !head data/sample.txt

# Type /help for all commands
infinigram [demo]> /help

See REPL_GUIDE.md for comprehensive documentation.

Basic Usage (Python API)

from infinigram import Infinigram

# Create a model from a corpus
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Predict next token
context = [2, 3]
probs = model.predict(context)
print(probs)
# {4: 0.6569, 5: 0.3301, 1: 0.0033, ...}

# Find longest matching suffix
position, length = model.longest_suffix(context)
print(f"Matched {length} tokens at position {position}")

# Get confidence score (0.0 to 1.0)
confidence = model.confidence(context)
print(f"Confidence: {confidence:.4f}")

# Update with new data
model.update([2, 3, 7, 8, 9])

Text Example

from infinigram import Infinigram

# Prepare text corpus
sentences = [
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat ran on the mat"
]

# Build vocabulary and tokenize
vocab = {}
corpus = []
for sent in sentences:
    for word in sent.split():
        if word not in vocab:
            vocab[word] = len(vocab)
        corpus.append(vocab[word])

# Create model
model = Infinigram(corpus, max_length=5)

# Predict after "the cat"
context = [vocab["the"], vocab["cat"]]
probs = model.predict(context)

# Convert predictions back to words
id_to_word = {v: k for k, v in vocab.items()}
for token_id, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:3]:
    print(f"{id_to_word[token_id]}: {prob:.3f}")

๐Ÿ”‘ Key Features

Variable-Length Pattern Matching

Infinigram automatically finds the longest matching suffix in the corpus:

corpus = [1, 2, 3, 4, 5, 1, 2, 3, 6, 1, 2, 3, 4, 7]
model = Infinigram(corpus, max_length=10)

# Short context - finds exact match
model.longest_suffix([1, 2])      # Returns (0, 2)

# Longer context - finds longer match
model.longest_suffix([1, 2, 3, 4]) # Returns (0, 4)

Confidence Scoring

Longer matches provide higher confidence:

model.confidence([1])           # Low confidence (~0.07)
model.confidence([1, 2])        # Medium confidence (~0.15)
model.confidence([1, 2, 3, 4])  # High confidence (~0.29)

Dynamic Corpus Updates

Add new data without rebuilding from scratch:

model = Infinigram([1, 2, 3, 4, 5])
print(len(model.corpus))  # 5

model.update([6, 7, 8, 9])
print(len(model.corpus))  # 9

# Predictions automatically reflect new data

Efficient Suffix Arrays

The underlying suffix array provides:

  • O(n log n) construction time
  • O(m log n) query time (m = pattern length, n = corpus size)
  • O(n) space complexity
  • Binary search for pattern matching

๐Ÿ“Š Performance

From benchmarks on various corpus sizes:

Corpus SizeConstructionAvg PredictionAvg Suffix Search
100 tokens0.07 ms0.043 ms0.014 ms
1K tokens6.09 ms0.390 ms0.184 ms
10K tokens718 ms4.370 ms2.373 ms

Memory efficiency: For a 1B token corpus:

  • Hash-based 5-gram: ~34 GB
  • Infinigram suffix array: ~1 GB
  • 34x reduction in memory usage

๐Ÿ› ๏ธ API Reference

Infinigram Class

class Infinigram:
    def __init__(
        self,
        corpus: List[int],
        max_length: Optional[int] = None,
        min_count: int = 1,
        smoothing: float = 0.01
    )

Parameters:

  • corpus: List of integer token IDs
  • max_length: Maximum pattern length to consider (None = unlimited)
  • min_count: Minimum occurrences for a pattern to be included
  • smoothing: Laplace smoothing factor for unseen tokens

Methods

predict(context, top_k=50)

Predict probability distribution over next tokens.

Returns: Dict[int, float] - Token ID to probability mapping

longest_suffix(context)

Find longest matching suffix in corpus.

Returns: Tuple[int, int] - (position, length) of match

confidence(context)

Get confidence score based on match length.

Returns: float - Confidence score (0.0 to 1.0)

continuations(context, position, length)

Get all tokens that follow a matched pattern.

Returns: Counter[int] - Token counts

update(new_tokens)

Add new tokens to corpus and rebuild suffix array.

Parameters:

  • new_tokens: List of new token IDs to add

๐Ÿงช Testing

The package includes 36 comprehensive tests:

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=infinigram --cov-report=html

# Run specific test categories
pytest tests/test_suffix_array.py    # Suffix array tests
pytest tests/test_infinigram.py      # Infinigram tests
pytest tests/test_integration.py     # Integration tests

Test coverage:

  • โœ… Suffix array construction and queries
  • โœ… Longest suffix matching
  • โœ… Continuation probability computation
  • โœ… Prediction with smoothing
  • โœ… Confidence scoring
  • โœ… Dynamic corpus updates
  • โœ… Edge cases (empty corpus, long contexts)
  • โœ… Integration scenarios (Wikipedia, code completion)

๐Ÿ“š Documentation

๐Ÿ”ฌ Use Cases

Code Completion

# Train on source code
code_corpus = tokenize_source_files("src/**/*.py")
model = Infinigram(code_corpus, max_length=50)

# Complete code
context = tokenize("def factorial(n):")
suggestions = model.predict(context, top_k=5)

Text Generation

# Train on text corpus
text_corpus = tokenize_books(["book1.txt", "book2.txt"])
model = Infinigram(text_corpus, max_length=20)

# Generate text
context = tokenize("Once upon a time")
next_word = sample_from_distribution(model.predict(context))

Autocomplete

# Train on user queries
query_corpus = tokenize_queries(user_queries)
model = Infinigram(query_corpus, max_length=10)

# Suggest completions
partial_query = tokenize("how to make")
completions = model.predict(partial_query, top_k=10)

๐Ÿค Integration

Infinigram can be used standalone or integrated with other systems:

With LangCalc

# Use as a language model in the LangCalc framework
from langcalc.models import LanguageModel
from infinigram import Infinigram

# Infinigram implements the LanguageModel interface
wiki = Infinigram(wikipedia_corpus, max_length=20)
news = Infinigram(news_corpus, max_length=15)

# Compose with other models
model = 0.7 * llm + 0.2 * wiki + 0.1 * news

Custom Tokenization

from infinigram import Infinigram

class TextInfinigram:
    def __init__(self, texts, max_length=20):
        # Build vocabulary
        self.vocab = {}
        corpus = []
        for text in texts:
            for token in self.tokenize(text):
                if token not in self.vocab:
                    self.vocab[token] = len(self.vocab)
                corpus.append(self.vocab[token])

        self.id_to_token = {v: k for k, v in self.vocab.items()}
        self.model = Infinigram(corpus, max_length=max_length)

    def tokenize(self, text):
        return text.lower().split()

    def predict_text(self, context_text):
        context = [self.vocab[t] for t in self.tokenize(context_text) if t in self.vocab]
        probs = self.model.predict(context)
        return {self.id_to_token[tid]: p for tid, p in probs.items()}

๐Ÿ”„ Comparison with Traditional N-grams

FeatureTraditional N-gramInfinigram
Pattern LengthFixed (n)Variable (1 to max_length)
MemoryO(V^n) exponentialO(corpus_size) linear
Query TimeO(1) hash lookupO(m log n) suffix search
Context UsageLast n-1 tokensLongest matching suffix
UpdatesFast (hash insert)Slow (rebuild suffix array)
Best ForFrequent updates, small nLarge patterns, static/batch updates

๐Ÿ“ˆ Roadmap

  • Incremental suffix array updates (avoid full rebuild)
  • Compressed suffix arrays for large corpora
  • Parallel suffix array construction
  • GPU acceleration for batch predictions
  • Integration with popular NLP libraries
  • Pre-trained models for common domains
  • Support for character-level and subword tokenization
  • Streaming corpus updates

๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

MIT License - see LICENSE for details.

๐Ÿ“ž Contact

๐Ÿ™ Acknowledgments

Infinigram was originally developed as part of the LangCalc project, which explores algebraic frameworks for language model composition.

๐Ÿ“š References

  • Suffix Arrays: Manber, U., & Myers, G. (1993). “Suffix arrays: a new method for on-line string searches”
  • Variable-length n-grams in language modeling
  • Efficient text indexing and retrieval

Built with โค๏ธ for efficient language modeling

Discussion