Skip to content

Infinigram Architecture & Vision

Version: 0.4.0 Date: December 2025 Status: Production

Vision

Infinigram is a high-speed, corpus-based language model that leverages suffix arrays for variable-length n-gram matching. Unlike traditional neural LMs, Infinigram provides:

  • Instant training: Models are corpora (no gradient descent)
  • Exact matching: Finds actual patterns from training data
  • Explainability: Every prediction traces back to corpus evidence
  • Speed: Orders of magnitude faster than neural inference
  • LLM grounding: Weight mixture with neural LM next-token probabilities

Core Use Cases

1. LLM Fine-tuning via Probability Mixing

# Weighted mixture of neural LM and corpus-based predictions
final_probs = 0.7 * llm.predict(context) + 0.3 * infinigram.predict(context)

Benefits: - Ground LLM outputs in specific corpora (technical docs, legal text, etc.) - Boost domain-specific vocabulary without expensive fine-tuning - Reduce hallucinations by anchoring to real text - Real-time adaptation without retraining

2. Multi-Corpus Models

# Load multiple specialized corpora
infinigram serve \
  --corpus wikipedia:/data/wiki.bin \
  --corpus shakespeare:/data/shakespeare.bin \
  --corpus python-docs:/data/python-stdlib.bin \
  --port 8000

3. Projection-Based Matching

Beyond simple longest suffix matching, support: - Input projections: Transform query context to find better matches (e.g., lemmatization, semantic clustering) - Hierarchical matching: Weight contributions from multiple suffix lengths - Output projections: Map predicted tokens to target vocabulary

Architectural Principles

1. Clean API Layers

┌─────────────────────────────────────┐
│         CLI & Shell                 │  User-facing commands + REPL
├─────────────────────────────────────┤
│         REST API                    │  HTTP endpoints (OpenAI-compatible)
├─────────────────────────────────────┤
│      Python API (Core)              │  Core Infinigram class
├─────────────────────────────────────┤
│    Suffix Array Engine              │  Pattern matching primitives
└─────────────────────────────────────┘

2. REST API Design (OpenAI-Compatible)

Completions Endpoint

POST /v1/completions
Content-Type: application/json

{
  "model": "wikipedia",
  "prompt": "The capital of France is",
  "max_tokens": 10,
  "temperature": 1.0,
  "top_k": 50
}

Response:
{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1697558400,
  "model": "wikipedia",
  "choices": [{
    "text": " Paris",
    "index": 0,
    "logprobs": {...},
    "finish_reason": "stop",
    "metadata": {
      "match_length": 4,
      "confidence": 0.89,
      "corpus_position": 1234567
    }
  }]
}

Chat Endpoint

POST /v1/chat/completions
Content-Type: application/json

{
  "model": "python-docs",
  "messages": [
    {"role": "user", "content": "How do I read a file in Python?"}
  ],
  "max_tokens": 100
}

Models Management

GET /v1/models
POST /v1/models/load
DELETE /v1/models/{model_id}
GET /v1/models/{model_id}/stats

3. Python API (Enhanced Core)

from infinigram import Infinigram, InfinigramServer

# Basic usage (unchanged for backward compatibility)
model = Infinigram(corpus, max_length=20)
probs = model.predict(context, top_k=10)

# Enhanced: Multi-length matching with weights
probs = model.predict_weighted(
    context,
    min_length=1,
    max_length=10,
    weight_fn=lambda length: length ** 2  # Quadratic weighting
)

# Projection-based matching
probs = model.predict_projected(
    context,
    input_projection="lemmatize",
    output_projection="top_frequent_10k"
)

# Corpus management
model.add_corpus(new_texts, corpus_id="technical_docs")
model.remove_corpus("old_corpus")

# Model serving
server = InfinigramServer(port=8000)
server.add_model("wiki", corpus_path="wiki.bin")
server.add_model("code", corpus_path="github.bin", max_length=50)
server.start()

4. CLI Design

# Training (corpus building)
infinigram build wikipedia.txt -o wikipedia.igram --max-length 20
infinigram build *.txt -o combined.igram --merge

# Serving
infinigram serve wikipedia.igram --port 8000
infinigram serve wikipedia.igram code.igram --port 8000

# Interactive shell
infinigram shell wikipedia.igram
> load shakespeare.igram as shakespeare
> predict "to be or not to"
> set max_length 15
> set weight_fn quadratic
> stats wikipedia
> exit

# One-shot predictions
infinigram predict wikipedia.igram "The capital of"
infinigram complete --model wiki --text "Once upon a" --max-tokens 50

# Model inspection
infinigram info wikipedia.igram
infinigram stats wikipedia.igram
infinigram search wikipedia.igram "machine learning"

5. Shell (Stateful REPL)

# Interactive shell with state management
$ infinigram shell

infinigram> load wikipedia.igram as wiki
Loaded: wiki (125M tokens, max_length=20)

infinigram> load shakespeare.igram as shakespeare
Loaded: shakespeare (884K tokens, max_length=15)

infinigram> models
- wiki: 125M tokens, max_length=20
- shakespeare: 884K tokens, max_length=15

infinigram> use wiki
Active model: wiki

infinigram> predict "The capital of France is"
Top predictions:
  Paris (0.856) ████████████████████████████
  located (0.089) ███
  situated (0.034) 

infinigram> set temperature 0.5
infinigram> set top_k 20

infinigram> match-info "The capital of France is"
Longest match: length=4 ("capital of France is")
Position: 1234567
Context: "...The capital of France is Paris, and it is..."
Confidence: 0.89

infinigram> history
1. predict "The capital of France is"
2. match-info "The capital of France is"

infinigram> export history results.json
infinigram> exit

Advanced Features

1. Hierarchical Suffix Weighting

Instead of only using longest match, combine predictions from multiple suffix lengths:

# P(next | context) = Σ w(k) * P(next | suffix_k)
# where suffix_k is the k-length suffix match

def weight_function(match_length, max_length):
    """Weight longer matches more heavily."""
    return (match_length / max_length) ** 2

probs = model.predict_hierarchical(
    context,
    min_length=1,
    max_length=10,
    weight_fn=weight_function
)

2. Input Projections

Transform input context to find better matches:

class InputProjection:
    """Transform context before suffix matching."""

    def lemmatize(self, tokens: List[int]) -> List[int]:
        """Reduce tokens to lemmas."""
        pass

    def semantic_cluster(self, tokens: List[int]) -> List[int]:
        """Map to semantic cluster IDs."""
        pass

    def drop_stopwords(self, tokens: List[int]) -> List[int]:
        """Remove common stopwords."""
        pass

# Usage
model.predict(context, input_projection="lemmatize")

3. Output Projections

Filter or transform predicted tokens:

class OutputProjection:
    """Filter/transform output predictions."""

    def top_k_frequent(self, probs: Dict[int, float], k: int) -> Dict[int, float]:
        """Restrict to k most frequent vocabulary tokens."""
        pass

    def domain_filter(self, probs: Dict[int, float], domain: str) -> Dict[int, float]:
        """Only allow domain-specific vocabulary."""
        pass

# Usage
model.predict(context, output_projection="top_frequent_10k")

4. Multi-Scale Matching

# Combine evidence from different granularities
model = MultiScaleInfinigram([
    ("char", char_corpus, max_length=100),
    ("subword", bpe_corpus, max_length=50),
    ("word", word_corpus, max_length=20)
])

# Automatically blends predictions across scales
probs = model.predict(context, scales=["word", "subword"])

5. Corpus Versioning & Hot-Swapping

server = InfinigramServer()

# Load initial corpus
server.add_model("v1", corpus_v1)

# Later: hot-swap without downtime
server.update_model("v1", corpus_v2)  # Atomic replacement

# A/B testing
server.add_model("experimental", corpus_exp)
probs_control = server.predict("v1", context)
probs_exp = server.predict("experimental", context)

Implementation Roadmap

Phase 1: Core API ✅ Complete

  • Remove LangCalc dependencies
  • Fix test imports
  • Unified Infinigram class with runtime transforms
  • Add predict_weighted() for multi-length matching
  • Add predict_backoff() for Stupid Backoff smoothing
  • Add find_all_suffix_matches() for introspection
  • Comprehensive unit tests (429 tests, 93% coverage)

Phase 2: REST API Server ✅ Complete

  • FastAPI-based REST server
  • OpenAI-compatible /v1/completions endpoint
  • Model loading/unloading endpoints
  • Introspection endpoints (/v1/predict, /v1/suffix_matches, /v1/confidence)
  • Backoff smoothing endpoint (/v1/predict_backoff)
  • Authentication & rate limiting
  • Streaming responses
  • Docker container

Phase 3: CLI & Shell ✅ Complete

  • Entry points for REPL and server
  • infinigram-serve for starting server
  • infinigram-repl for interactive REPL
  • Unix-style navigation (pwd, cd, ls)
  • Tab completion, history
  • Click-based CLI with subcommands
  • infinigram build for corpus creation

Phase 4: Advanced Matching ✅ Partial

  • Hierarchical suffix weighting (predict_weighted)
  • Configurable weight functions (linear, quadratic, exponential, sigmoid)
  • Runtime query transforms (lowercase, uppercase, casefold, strip, normalize_whitespace)
  • Input projections (lemmatization, semantic)
  • Output projections (filtering, mapping)
  • Multi-scale matching (char/subword/word)

Phase 5: Performance & Scale ✅ Partial

  • Binary search for suffix array queries (O(m log n))
  • Memory-mapped corpus files via pydivsufsort
  • Compressed suffix arrays
  • Parallel construction
  • GPU acceleration for batch inference

Phase 6: Ecosystem & Integration

  • Pre-built corpus packages (Wikipedia, Common Crawl, etc.)
  • Tokenizer compatibility layer for popular models (GPT, Llama)
  • LangChain/LlamaIndex integration
  • Hugging Face integration
  • Evaluation benchmarks

File Structure (Target)

infinigram/
├── infinigram/
│   ├── __init__.py
│   ├── core/
│   │   ├── infinigram.py           # Core model class
│   │   ├── suffix_array.py         # Suffix array engine
│   │   ├── projections.py          # Input/output projections
│   │   └── weighting.py            # Weighting functions
│   ├── server/
│   │   ├── api.py                  # FastAPI app
│   │   ├── models.py               # Model management
│   │   ├── auth.py                 # Authentication
│   │   └── streaming.py            # Streaming responses
│   ├── cli/
│   │   ├── main.py                 # Click CLI entry point
│   │   ├── build.py                # Corpus building
│   │   ├── serve.py                # Server management
│   │   ├── predict.py              # One-shot inference
│   │   └── shell.py                # Interactive REPL
│   └── utils/
│       ├── tokenizer.py            # Tokenization utilities
│       ├── corpus.py               # Corpus I/O
│       └── serialization.py        # Model serialization
├── tests/
│   ├── test_core/
│   ├── test_server/
│   ├── test_cli/
│   └── test_integration/
├── docs/
└── benchmarks/

Design Principles

1. Speed First

Infinigram's killer feature is speed. Every design decision should preserve this: - Pre-computed suffix arrays (no online construction) - Memory-mapped corpora for large datasets - Avoid Python loops in hot paths (use NumPy/Cython) - Batch operations where possible

2. Simplicity & Composability

  • Unix philosophy: do one thing well (pattern matching + prediction)
  • Easy to compose with other models (mixture weights)
  • Clean separation: core logic, server, CLI

3. Explainability

Every prediction should be traceable: - Return corpus positions of matches - Show actual text context - Confidence scores based on match quality

4. Backward Compatibility

  • Maintain existing Infinigram API for current users
  • Deprecation warnings before breaking changes
  • Versioned REST API (/v1/, /v2/)

Ideas for Sample Efficiency

1. Fuzzy Matching

  • Allow 1-2 token substitutions in suffix matching
  • Use edit distance to find "close enough" matches
  • Weight by similarity score

2. Semantic Clustering

  • Cluster tokens by embeddings
  • Match on cluster IDs instead of exact tokens
  • Find longer "semantic suffixes"

3. Frequency-Based Fallbacks

  • When no long match found, use shorter matches from high-frequency contexts
  • Weight by corpus frequency (common phrases matter more)

4. Context Expansion

  • Look for matches in expanded window (e.g., bag-of-words nearby)
  • Find non-contiguous matches

5. Hybrid Neural-Symbolic

  • Use neural encoder for context → embedding
  • Nearest neighbor search in embedding space for similar corpus contexts
  • Use those contexts' continuations

Performance Targets

  • Construction: 1M tokens/second
  • Query latency: <10ms for 100-token context
  • Throughput: 1000+ queries/second on single CPU
  • Memory: <10 bytes per corpus token
  • Scaling: 1B+ token corpora

Success Metrics

  1. API adoption: Used in 10+ downstream projects
  2. Performance: 100x faster than neural LM inference
  3. Accuracy: Competitive perplexity on domain-specific corpora
  4. LLM improvement: Measurable reduction in hallucinations when mixed with LLMs
  5. Ease of use: New model trained and deployed in <5 minutes