active language

langcalc

Started 2024 HTML

Resources & Distribution

Source Code

GitHub Repository GitHub Pages

Package Registries

Publications

An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints

(2025)

PDF

1 Stars

LangCalc: A Calculus for Language Models

An elegant mathematical framework for composing language models through algebraic operations, featuring efficient suffix array-based grounding (infinigrams) and lightweight model mixing.

🎯 Overview

LangCalc introduces a comprehensive algebraic framework for language model composition that treats models as first-class mathematical objects. The key innovation is lightweight grounding: combining Large Language Models (LLMs) with suffix array-based pattern matching (infinigrams) using just 5% weight to achieve 70% perplexity reduction.

# Express sophisticated models as elegant algebra
from langcalc import Infinigram, create_infinigram
from langcalc.models import NGramModel, HuggingFaceModel

# Create infinigram from Wikipedia
wiki = Infinigram(wikipedia_corpus, max_length=20)

# Compose with LLM
model = 0.95 * llm + 0.05 * wiki

# Or use algebra module
from langcalc.algebra import LongestSuffixTransform
grounded = llm + (wiki << LongestSuffixTransform(sa))

📁 Project Structure

langcalc/                # Main package (NEW in v0.4.0)
├── __init__.py         # Public API
├── infinigram.py       # Variable-length n-grams
├── algebra.py          # Algebraic framework with 10+ operators
├── grounding.py        # Lightweight grounding system
├── models/             # Language model implementations
│   ├── base.py        # LanguageModel interface
│   ├── ngram.py       # N-gram models
│   ├── llm.py         # LLM wrappers
│   └── mixture.py     # Model composition
├── projections/        # Context transformations
│   ├── recency.py     # Recency-based projection
│   ├── semantic.py    # Semantic similarity
│   └── edit_distance.py # Edit distance projection
└── data/               # Data structures
    ├── suffix_array.py # Efficient suffix arrays
    └── incremental.py  # Incremental updates

examples/               # Usage examples
├── algebra_examples.py
├── comprehensive_experiments.py
└── lightweight_experiments.py

tests/                  # Test suite (299 tests)
├── test_unit/         # 262 unit tests
└── test_integration/  # 37 integration tests

notebooks/              # Jupyter demos
├── explore_algebra.ipynb
├── lightweight_grounding_demo.ipynb
└── unified_algebra.ipynb

papers/                 # Academic paper (LaTeX)
└── paper.pdf

🚀 Quick Start

Installation

# Install from source (development mode)
git clone https://github.com/queelius/langcalc.git
cd langcalc
pip install -e .

# Or with development dependencies
pip install -e .[dev]

# Or with experiment dependencies
pip install -e .[experiments]

Basic Usage

# Import the package
from langcalc import Infinigram, NGramModel, create_infinigram

# Create an infinigram model
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)

# Predict next token
context = [2, 3]
probs = model.predict(context)  # Variable-length suffix matching
print(f"Predictions: {probs}")

# Compose models
ngram = NGramModel(corpus, n=3)
mixture = 0.7 * model + 0.3 * ngram

Run Examples

# Run algebraic examples
python examples/algebra_examples.py

# Run comprehensive experiments
python examples/comprehensive_experiments.py

# Run lightweight grounding experiments
python examples/lightweight_experiments.py

# Interactive notebooks
jupyter notebook notebooks/explore_algebra.ipynb

Run Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=langcalc --cov-report=html

# Run specific test categories
pytest tests/test_unit/          # Unit tests only
pytest tests/test_integration/   # Integration tests only

🔑 Key Features

Algebraic Operators

10+ operators: +, *, |, &, ^, **, >>, <<, ~
Context transforms: LongestSuffix, MaxKWords, RecencyWeight
Mathematical consistency: Associativity, distributivity, composability

Suffix Arrays

34x more memory efficient than n-gram hash tables
O(m log n) query time with binary search
Variable-length patterns without pre-computing n

Infinigrams (NEW in v0.4.0)

Variable-length n-grams with dynamic pattern matching
Automatic suffix array construction for efficient queries
O(m log n) query time for longest suffix matching
Incremental updates for streaming data
36 comprehensive tests covering all functionality

Production Ready

299 comprehensive tests with 100% pass rate ✅
95% coverage on core algebraic framework ⭐
Tested with Ollama integration
Only 6.5% overhead (2.66ms) with real LLMs
70% perplexity reduction with 5% grounding weight

📊 Results

Metric	Value
Package Version	0.4.0 (Beta) 🎉
Test Suite	299 tests (all passing) ✅
Test Coverage	95% on langcalc.algebra ⭐
Infinigram Tests	36 tests (NEW) ✨
Memory Efficiency	34x better (1GB vs 34GB)
Query Latency	0.03ms (suffix arrays)
Perplexity Reduction	70%
LLM Overhead	6.5%
Optimal Weight	95% LLM + 5% suffix

📚 Documentation

Complete Documentation

📖 Read the Full Documentation →

The complete LangCalc documentation includes:

Getting Started - Installation, quick start, core concepts
Projection System - Mathematical formalism and implementation
User Guide - Comprehensive guides and examples
API Reference - Detailed API documentation
Advanced Topics - Suffix arrays, grounding, performance
Development Guide - Contributing, testing, code style

Quick Links

Algebraic Design - Complete API reference
Test Suite - Comprehensive test documentation and coverage
Academic Paper - Formal treatment
Examples - Practical usage
Results Analysis - Experimental findings
Test Coverage Reports - Detailed coverage analysis

🔬 Research Contributions

Unified Algebraic Framework: Treating language models as algebraic objects
Lightweight Grounding: Minimal weight (5%) for maximum benefit
Suffix Array Integration: Scalable alternative to n-grams
Context Transformations: Sophisticated operators for model composition

📖 Citation

@article{langcalc-2025,
  title={LangCalc: A Calculus for Compositional Language Modeling with Infinigram Grounding},
  year={2025}
}

🚦 Future Work

Learnable operator weights
Automatic composition search
GPU acceleration
Distributed suffix arrays

LangCalc - A calculus for language models. Built with mathematical elegance and engineering pragmatism.

Related Resources

Explore related blog posts, projects, and publications

Blog Posts

/post/2024-08-01-langcalc/

Publications

An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints

(2025)

PDF

Resources & Distribution

Source Code

Package Registries

Publications

An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints

LangCalc: A Calculus for Language Models

🎯 Overview

📁 Project Structure

🚀 Quick Start

Installation

Basic Usage

Run Examples

Run Tests

🔑 Key Features

Algebraic Operators

Suffix Arrays

Infinigrams (NEW in v0.4.0)

Production Ready

📊 Results

📚 Documentation

Complete Documentation

Quick Links

🔬 Research Contributions

📖 Citation

🚦 Future Work

Related Resources

Blog Posts

Publications

An Algebraic Framework for Language Model Composition: Unifying Projections, Mixtures, and Constraints

Discussion