Infinigram Documentation¶

Welcome to the Infinigram documentation! Infinigram is a high-speed, corpus-based language model that uses suffix arrays for variable-length n-gram pattern matching.

Version: 0.4.0

What is Infinigram?¶

Unlike traditional neural language models or fixed-order n-grams, Infinigram:

✅ Trains instantly: Models are corpora (no gradient descent needed)
✅ Finds variable-length patterns: Automatically uses longest matching context
✅ Provides exact matching: Every prediction traces back to actual corpus occurrences
✅ Runs extremely fast: Orders of magnitude faster than neural inference
✅ Enables LLM grounding: Can be mixed with neural LM probabilities for domain adaptation

Key Features¶

Runtime Query Transforms¶

Handle out-of-distribution (OOD) data through runtime query transformations:

Case normalization: lowercase, uppercase, casefold
Whitespace normalization: strip, normalize_whitespace
Sequential composition: Apply multiple transforms in order
Beam search: Explore transform combinations with predict_search()

Multi-Length Suffix Matching¶

find_all_suffix_matches(): Find all matching suffixes at different lengths with corpus positions
predict_weighted(): Combine predictions from multiple suffix lengths with configurable weighting
predict_backoff(): Stupid Backoff smoothing algorithm (Brants et al., 2007)

Interactive REPL¶

Unix-style navigation and model management: - pwd, cd, ls - Navigate between models - predict, complete - Make predictions - Projection-based augmentation

REST API (OpenAI-compatible)¶

Full-featured REST API with introspection endpoints: - /v1/completions - Generate text completions - /v1/predict - Get next-byte predictions - /v1/suffix_matches - Find all suffix matches - /v1/confidence - Get confidence scores

Quick Start¶

from infinigram import Infinigram

# Create model from corpus
corpus = b"the cat sat on the mat"
model = Infinigram(corpus, max_length=10)

# Predict next token
context = b"the cat"
probs = model.predict(context)
print(probs)  # {115: 0.657, 97: 0.330, ...}  # 's' (sat), 'a' (at)

With Runtime Transforms¶

from infinigram import Infinigram

# Set default transforms at model creation
model = Infinigram(corpus, default_transforms=['lowercase'])

# Handles case variations automatically
context = b"The Cat"  # Uppercase
probs = model.predict(context)

# Or specify transforms per-call
probs = model.predict(b"THE CAT", transforms=['lowercase', 'strip'])

# Beam search over transform combinations
probs = model.predict_search(context, search=['lowercase', 'casefold'])

With Backoff Smoothing¶

# Stupid Backoff: uses longest match if confident, backs off with penalty
probs = model.predict_backoff(b"the cat", backoff_factor=0.4)

# Find all suffix matches at different lengths
matches = model.find_all_suffix_matches(b"the cat")
# Returns: [(7, [pos1, pos2]), (3, [pos3, ...]), ...]  # (length, positions)

Documentation Sections¶

User Guides¶

Loading Datasets: Learn how to load and manage datasets
Benchmarks & Performance: Comprehensive benchmark results and analysis

Features¶

Transformation Scoring: Multi-factor scoring system for weighted predictions

Development¶

Test Strategy: TDD strategy and coverage analysis
Test Summary: Executive test coverage summary
Priority Tests: Ready-to-add test implementations

Installation¶

pip install -e .

For development:

pip install -e .[dev]

Running Tests¶

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=infinigram --cov-report=html

# Run specific test file
pytest tests/test_infinigram.py

Use Cases¶

1. Domain-Specific Grounding¶

Mix Infinigram probabilities with LLM probabilities to ground outputs in specific corpora:

llm_probs = llm.predict(context)
corpus_probs = infinigram.predict(context)
final_probs = 0.7 * llm_probs + 0.3 * corpus_probs

2. Fast Pattern Matching¶

Orders of magnitude faster than neural inference for pattern-based predictions.

3. Exact Source Attribution¶

Every prediction traces back to actual corpus occurrences.

4. Zero-Shot Domain Adaptation¶

No training required - just point at a corpus.

Architecture¶

See Architecture for detailed system design.

Contributing¶

Contributions welcome! See our comprehensive Test Strategy for testing guidelines.

License¶

[License information here]

Citation¶

@software{infinigram2024,
  title={Infinigram: Variable-Length N-gram Language Model},
  author={Towell, Alex},
  year={2024},
  url={https://github.com/queelius/infinigram}
}