langcalc
Resources & Distribution
Source Code
Package Registries
Publications
An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints
(2025)
LangCalc: A Calculus for Language Models
An elegant mathematical framework for composing language models through algebraic operations, featuring efficient suffix array-based grounding (infinigrams) and lightweight model mixing.
🎯 Overview
LangCalc introduces a comprehensive algebraic framework for language model composition that treats models as first-class mathematical objects. The key innovation is lightweight grounding: combining Large Language Models (LLMs) with suffix array-based pattern matching (infinigrams) using just 5% weight to achieve 70% perplexity reduction.
# Express sophisticated models as elegant algebra
from langcalc import Infinigram, create_infinigram
from langcalc.models import NGramModel, HuggingFaceModel
# Create infinigram from Wikipedia
wiki = Infinigram(wikipedia_corpus, max_length=20)
# Compose with LLM
model = 0.95 * llm + 0.05 * wiki
# Or use algebra module
from langcalc.algebra import LongestSuffixTransform
grounded = llm + (wiki << LongestSuffixTransform(sa))
📁 Project Structure
langcalc/ # Main package (NEW in v0.4.0)
├── __init__.py # Public API
├── infinigram.py # Variable-length n-grams
├── algebra.py # Algebraic framework with 10+ operators
├── grounding.py # Lightweight grounding system
├── models/ # Language model implementations
│ ├── base.py # LanguageModel interface
│ ├── ngram.py # N-gram models
│ ├── llm.py # LLM wrappers
│ └── mixture.py # Model composition
├── projections/ # Context transformations
│ ├── recency.py # Recency-based projection
│ ├── semantic.py # Semantic similarity
│ └── edit_distance.py # Edit distance projection
└── data/ # Data structures
├── suffix_array.py # Efficient suffix arrays
└── incremental.py # Incremental updates
examples/ # Usage examples
├── algebra_examples.py
├── comprehensive_experiments.py
└── lightweight_experiments.py
tests/ # Test suite (299 tests)
├── test_unit/ # 262 unit tests
└── test_integration/ # 37 integration tests
notebooks/ # Jupyter demos
├── explore_algebra.ipynb
├── lightweight_grounding_demo.ipynb
└── unified_algebra.ipynb
papers/ # Academic paper (LaTeX)
└── paper.pdf
🚀 Quick Start
Installation
# Install from source (development mode)
git clone https://github.com/queelius/langcalc.git
cd langcalc
pip install -e .
# Or with development dependencies
pip install -e .[dev]
# Or with experiment dependencies
pip install -e .[experiments]
Basic Usage
# Import the package
from langcalc import Infinigram, NGramModel, create_infinigram
# Create an infinigram model
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Predict next token
context = [2, 3]
probs = model.predict(context) # Variable-length suffix matching
print(f"Predictions: {probs}")
# Compose models
ngram = NGramModel(corpus, n=3)
mixture = 0.7 * model + 0.3 * ngram
Run Examples
# Run algebraic examples
python examples/algebra_examples.py
# Run comprehensive experiments
python examples/comprehensive_experiments.py
# Run lightweight grounding experiments
python examples/lightweight_experiments.py
# Interactive notebooks
jupyter notebook notebooks/explore_algebra.ipynb
Run Tests
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=langcalc --cov-report=html
# Run specific test categories
pytest tests/test_unit/ # Unit tests only
pytest tests/test_integration/ # Integration tests only
🔑 Key Features
Algebraic Operators
- 10+ operators:
+,*,|,&,^,**,>>,<<,~ - Context transforms: LongestSuffix, MaxKWords, RecencyWeight
- Mathematical consistency: Associativity, distributivity, composability
Suffix Arrays
- 34x more memory efficient than n-gram hash tables
- O(m log n) query time with binary search
- Variable-length patterns without pre-computing n
Infinigrams (NEW in v0.4.0)
- Variable-length n-grams with dynamic pattern matching
- Automatic suffix array construction for efficient queries
- O(m log n) query time for longest suffix matching
- Incremental updates for streaming data
- 36 comprehensive tests covering all functionality
Production Ready
- 299 comprehensive tests with 100% pass rate ✅
- 95% coverage on core algebraic framework ⭐
- Tested with Ollama integration
- Only 6.5% overhead (2.66ms) with real LLMs
- 70% perplexity reduction with 5% grounding weight
📊 Results
| Metric | Value |
|---|---|
| Package Version | 0.4.0 (Beta) 🎉 |
| Test Suite | 299 tests (all passing) ✅ |
| Test Coverage | 95% on langcalc.algebra ⭐ |
| Infinigram Tests | 36 tests (NEW) ✨ |
| Memory Efficiency | 34x better (1GB vs 34GB) |
| Query Latency | 0.03ms (suffix arrays) |
| Perplexity Reduction | 70% |
| LLM Overhead | 6.5% |
| Optimal Weight | 95% LLM + 5% suffix |
📚 Documentation
Complete Documentation
📖 Read the Full Documentation →
The complete LangCalc documentation includes:
- Getting Started - Installation, quick start, core concepts
- Projection System - Mathematical formalism and implementation
- User Guide - Comprehensive guides and examples
- API Reference - Detailed API documentation
- Advanced Topics - Suffix arrays, grounding, performance
- Development Guide - Contributing, testing, code style
Quick Links
- Algebraic Design - Complete API reference
- Test Suite - Comprehensive test documentation and coverage
- Academic Paper - Formal treatment
- Examples - Practical usage
- Results Analysis - Experimental findings
- Test Coverage Reports - Detailed coverage analysis
🔬 Research Contributions
- Unified Algebraic Framework: Treating language models as algebraic objects
- Lightweight Grounding: Minimal weight (5%) for maximum benefit
- Suffix Array Integration: Scalable alternative to n-grams
- Context Transformations: Sophisticated operators for model composition
📖 Citation
@article{langcalc-2025,
title={LangCalc: A Calculus for Compositional Language Modeling with Infinigram Grounding},
year={2025}
}
🚦 Future Work
- Learnable operator weights
- Automatic composition search
- GPU acceleration
- Distributed suffix arrays
LangCalc - A calculus for language models. Built with mathematical elegance and engineering pragmatism.
Related Resources
Explore related blog posts, projects, and publications
Blog Posts
Publications
An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints
(2025)