Skip to content

LangCalc: A Calculus for Language Models

**An elegant mathematical framework for compositional language modeling** [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![Tests](https://img.shields.io/badge/tests-299%20passing-brightgreen.svg)](https://github.com/queelius/langcalc/tree/master/tests) [![Coverage](https://img.shields.io/badge/coverage-95%25-green.svg)](https://github.com/queelius/langcalc) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/queelius/langcalc/blob/master/LICENSE)

What is LangCalc?

LangCalc introduces a comprehensive algebraic framework for language model composition that treats models as first-class mathematical objects. The key innovation is lightweight grounding: combining Large Language Models (LLMs) with suffix array-based pattern matching (infinigrams) using just 5% weight to achieve 70% perplexity reduction.

# Express sophisticated models as elegant algebra
from langcalc import Infinigram, create_infinigram
from langcalc.models import NGramModel, HuggingFaceModel

# Create infinigram from Wikipedia
wiki = Infinigram(wikipedia_corpus, max_length=20)

# Compose with LLM
model = 0.95 * llm + 0.05 * wiki

# Or use algebra module
from langcalc.algebra import LongestSuffixTransform
grounded = llm + (wiki << LongestSuffixTransform(sa))

Key Features

Mathematical Elegance

  • 10+ algebraic operators: +, *, |, &, ^, **, >>, <<, ~
  • Context transformations: LongestSuffix, MaxKWords, RecencyWeight
  • Proven properties: Associativity, distributivity, composability

Production Ready

  • 299 comprehensive tests with 100% pass rate
  • 95% code coverage on core algebraic framework
  • Tested with real LLMs (Ollama integration)
  • Minimal overhead: Only 6.5% (2.66ms) with real models

Efficient Pattern Matching

  • 34x more memory efficient than n-gram hash tables
  • O(m log n) query time with binary search
  • Variable-length patterns without pre-computing n
  • Incremental updates for streaming data

Projection System (NEW)

A rigorous mathematical framework for context transformation and corpus augmentation:

  • Projections: Query-time context transformations (\(\pi: \Sigma^* \times 2^{\Sigma^*} \to \Sigma^*\))
  • Augmentations: Training-time corpus expansion (\(\alpha: 2^{\Sigma^*} \to 2^{\Sigma^*}\))
  • Duality Theorem: Trade space for time via augmentation
  • Composition algebra: Build complex pipelines from simple parts

Quick Start

Installation

# Install from source (development mode)
git clone https://github.com/queelius/langcalc.git
cd langcalc
pip install -e .

# Or with development dependencies
pip install -e .[dev]

# Or with experiment dependencies
pip install -e .[experiments]

Your First Model

from langcalc import Infinigram, NGramModel

# Create an infinigram model
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Predict next token
context = [2, 3]
probs = model.predict(context)  # Variable-length suffix matching

# Compose models
ngram = NGramModel(corpus, n=3)
mixture = 0.7 * model + 0.3 * ngram

See the Quick Start Guide for more examples.


Core Concepts

Algebraic Framework

LangCalc treats language models as algebraic objects supporting mathematical operations:

# Arithmetic combinations
ensemble = 0.7 * llm + 0.2 * wiki + 0.1 * ngram

# Set operations
best_of = llm | wiki  # max probability
conservative = llm & wiki  # min probability

# Temperature scaling
creative = model ** 1.5  # Higher temperature
focused = model ** 0.5   # Lower temperature

Context Transformations

Transform context before model prediction:

from langcalc.algebra import LongestSuffixTransform, RecencyWeightTransform

# Find longest matching suffix in corpus
grounded = model << LongestSuffixTransform(suffix_array)

# Apply recency weighting
recent = model << RecencyWeightTransform(decay=0.9)

# Chain transformations
pipeline = model << (LongestSuffixTransform(sa) | RecencyWeightTransform(0.9))

Projection System

The projection system enables flexible context transformation:

from langcalc.projections import LowercaseProjection, WhitespaceProjection
from langcalc.models.projected import ProjectedModel

# Create projection pipeline
projection = (
    WhitespaceProjection() >>  # Normalize whitespace
    LowercaseProjection() >>   # Case-insensitive
    RecencyProjection(100)     # Keep recent tokens
)

# Apply to model
projected_model = ProjectedModel(base_model, projection, corpus)

See Core Concepts for detailed explanations.


Results

Metric Value
Perplexity Reduction 70%
Optimal Weight 95% LLM + 5% suffix
Memory Efficiency 34x better (1GB vs 34GB)
Query Latency 0.03ms (suffix arrays)
LLM Overhead 6.5% (2.66ms)
Test Coverage 95% on core modules

Documentation Structure

  • Getting Started


    Installation, quick start, and core concepts

    Get Started

  • Projection System


    Mathematical formalism for context transformation

    Learn More

  • User Guide


    Comprehensive guides and examples

    Read Guide

  • API Reference


    Detailed API documentation

    Browse API

  • Advanced Topics


    Suffix arrays, grounding, performance

    Explore

  • Development


    Contributing, testing, code style

    Contribute


Research Contributions

  1. Unified Algebraic Framework: Treating language models as algebraic objects with mathematical operations
  2. Lightweight Grounding: Minimal weight (5%) suffix array integration for maximum benefit (70% perplexity reduction)
  3. Suffix Array Integration: Scalable alternative to n-grams with 34x memory efficiency
  4. Context Transformations: Sophisticated operators for model composition
  5. Projection-Augmentation Duality: Novel theorem enabling space-time tradeoffs

See our Academic Paper for the formal treatment.


Example: Complete Pipeline

Here's a complete example showing LangCalc's power:

from langcalc import Infinigram
from langcalc.models import OllamaModel
from langcalc.projections import (
    EditDistanceProjection,
    LowercaseProjection,
    WhitespaceProjection,
    RecencyProjection
)
from langcalc.models.projected import ProjectedModel

# Load corpus
with open('wikipedia.txt', 'rb') as f:
    corpus = list(f.read())

# Create suffix array model
wiki = Infinigram(corpus, max_length=20)

# Create LLM
llm = OllamaModel(model_name='llama2', base_url='http://localhost:11434')

# Create projection pipeline
projection = (
    EditDistanceProjection(max_distance=1) >>  # Fix typos
    WhitespaceProjection() >>                   # Normalize whitespace
    LowercaseProjection() >>                    # Case-insensitive
    RecencyProjection(max_length=100)           # Keep recent context
)

# Compose models
projected_wiki = ProjectedModel(wiki, projection, corpus)
final_model = 0.95 * llm + 0.05 * projected_wiki

# Use the model
context = list("The capital of France is".encode('utf-8'))
probs = final_model.predict(context)
print(f"Next token probabilities: {probs}")

This example demonstrates:

  • Infinigram for efficient pattern matching
  • Projection pipeline for robust context transformation
  • Model composition using algebraic operators
  • Lightweight grounding with optimal mixing weights

Community


Citation

If you use LangCalc in your research, please cite:

@article{langcalc-2025,
  title={LangCalc: A Calculus for Compositional Language Modeling with Infinigram Grounding},
  year={2025}
}

License

LangCalc is released under the MIT License.


Ready to get started? Head to the Quick Start Guide or explore the Projection System!