Quick Start Guide¶
Get started with LangCalc in 5 minutes! This guide walks you through creating your first language model using LangCalc's algebraic framework.
Your First Infinigram Model¶
Let's create a simple infinigram model and make predictions:
from langcalc import Infinigram
# Create a simple corpus (byte-level tokens)
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
# Create an infinigram model (variable-length n-grams)
model = Infinigram(corpus, max_length=10)
# Make a prediction
context = [2, 3]
probs = model.predict(context, top_k=10)
print(f"Top predictions after [2, 3]: {probs}")
What's happening?
Infinigramcreates a model using suffix arrays for efficient pattern matchingmax_length=10means it considers patterns up to 10 tokens longpredict()returns probability distribution over next tokens- The model finds that after
[2, 3], tokens4and5are likely (they appear in the corpus)
Composing Models with Algebra¶
LangCalc's power comes from algebraic composition:
from langcalc import Infinigram, NGramModel
# Create two models
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
infini = Infinigram(corpus, max_length=10)
ngram = NGramModel(corpus, n=3)
# Compose them with weights
model = 0.7 * infini + 0.3 * ngram
# Make predictions
context = [2, 3]
probs = model.predict(context)
What's happening?
- We create two different models (infinigram and 3-gram)
- Use
*for weighted mixture:0.7 * infinimeans 70% weight - Use
+for ensemble:0.7 * infini + 0.3 * ngramcombines them - The result is a new model that leverages both approaches
Working with Text¶
Convert text to byte-level tokens:
from langcalc import Infinigram
# Convert text to bytes
text = "the cat sat on the mat"
corpus = list(text.encode('utf-8'))
# Create model
model = Infinigram(corpus, max_length=20)
# Query with context
context_text = "the cat"
context = list(context_text.encode('utf-8'))
# Predict next tokens
probs = model.predict(context, top_k=256) # All possible bytes
# Convert predictions back to characters
for token_id, prob in probs.items():
if prob > 0.1: # Only show high-probability predictions
char = chr(token_id) if 32 <= token_id < 127 else f"\\x{token_id:02x}"
print(f" '{char}': {prob:.3f}")
Using Projections¶
Transform context before matching:
from langcalc.projections import LowercaseProjection, WhitespaceProjection
from langcalc.models.projected import ProjectedModel
# Create corpus
corpus = list("Hello World! HELLO WORLD!".encode('utf-8'))
base_model = Infinigram(corpus, max_length=20)
# Create projection pipeline (normalize before matching)
projection = WhitespaceProjection() >> LowercaseProjection()
# Apply projection to model
model = ProjectedModel(base_model, projection, corpus)
# Now queries are case-insensitive and whitespace-normalized
context = list("hello world".encode('utf-8'))
probs = model.logprobs(list(range(256)), context)
What's happening?
WhitespaceProjection()normalizes whitespaceLowercaseProjection()converts to lowercase>>chains them left-to-rightProjectedModelapplies the pipeline before each query
Using Augmentations¶
Alternatively, augment the corpus once at training time:
from langcalc.augmentations import LowercaseAugmentation
from langcalc import Infinigram
# Create corpus
corpus = list("Hello World".encode('utf-8'))
# Augment corpus (add lowercase variant)
augmentation = LowercaseAugmentation()
augmented_corpus = augmentation.augment(corpus)
# Create model with augmented corpus
model = Infinigram(augmented_corpus, max_length=20)
# Now case-insensitive matching works automatically
context = list("HELLO".encode('utf-8'))
probs = model.predict(context, top_k=256)
Projection vs Augmentation:
- Projection: Transform query at prediction time (flexible, uses less memory)
- Augmentation: Transform corpus at training time (faster queries, uses more memory)
See Core Concepts for detailed comparison.
Advanced Example: Lightweight Grounding¶
Combine LLM with infinigram for factual grounding:
from langcalc import Infinigram
from langcalc.models import OllamaModel
# Create knowledge base (e.g., Wikipedia)
with open('wikipedia.txt', 'rb') as f:
wiki_corpus = list(f.read())
wiki = Infinigram(wiki_corpus, max_length=20)
# Create LLM
llm = OllamaModel(model_name='llama2')
# Optimal mixture: 95% LLM + 5% infinigram
# (Based on research showing 70% perplexity reduction)
grounded_model = 0.95 * llm + 0.05 * wiki
# Make predictions
context = list("The capital of France is".encode('utf-8'))
probs = grounded_model.predict(context, top_k=50)
Why this works:
- LLM provides fluent generation
- Infinigram provides factual grounding
- Only 5% weight needed for 70% perplexity reduction
- Infinigram adds only 0.03ms latency
More Algebraic Operations¶
LangCalc supports many operators:
# Set operations
best_model = llm | wiki # max(p_llm, p_wiki)
conservative = llm & wiki # min(p_llm, p_wiki)
diff = llm ^ wiki # symmetric difference
# Temperature scaling
creative = model ** 1.5 # Higher temperature
focused = model ** 0.5 # Lower temperature
# Negation (complement)
anti_model = ~model # 1 - p(x)
# Subtraction (experimental)
residual = llm - ngram # What LLM learned beyond n-grams
Complete Example¶
Putting it all together:
from langcalc import Infinigram, NGramModel
from langcalc.models import OllamaModel
from langcalc.projections import (
EditDistanceProjection,
LowercaseProjection,
WhitespaceProjection,
RecencyProjection
)
from langcalc.models.projected import ProjectedModel
# 1. Load corpus
corpus = list(open('corpus.txt', 'rb').read())
# 2. Create models
wiki = Infinigram(corpus, max_length=20)
ngram = NGramModel(corpus, n=5)
llm = OllamaModel(model_name='llama2')
# 3. Create projection pipeline
projection = (
EditDistanceProjection(max_distance=1) >> # Fix typos
WhitespaceProjection() >> # Normalize whitespace
LowercaseProjection() >> # Case-insensitive
RecencyProjection(max_length=100) # Recent tokens
)
# 4. Apply projection to wiki
projected_wiki = ProjectedModel(wiki, projection, corpus)
# 5. Compose final model
model = (
0.85 * llm + # 85% LLM
0.10 * projected_wiki + # 10% projected wiki
0.05 * ngram # 5% n-gram smoothing
) ** 0.9 # Lower temperature slightly
# 6. Make predictions
context = list("The quick brown fox".encode('utf-8'))
predictions = model.predict(context, top_k=20)
# 7. Sample text
samples = model.sample(context, temperature=1.0, max_tokens=50)
generated_text = bytes(samples).decode('utf-8', errors='ignore')
print(f"Generated: {generated_text}")
Interactive Exploration¶
Try the Jupyter notebooks for interactive experimentation:
# Start Jupyter
jupyter notebook
# Open notebooks
# 1. notebooks/explore_algebra.ipynb (45 min - foundations)
# 2. notebooks/lightweight_grounding_demo.ipynb (60 min - practical)
# 3. notebooks/unified_algebra.ipynb (60 min - advanced)
Next Steps¶
Now that you've created your first models, learn more about:
- Core Concepts - Understand projections, augmentations, and algebra
- User Guide - Explore practical patterns and best practices
- Projection System - Deep dive into mathematical formalism
- Examples - More complete examples and use cases
Common Patterns¶
Pattern 1: Case-Insensitive Matching¶
from langcalc.augmentations import LowercaseAugmentation
augmented_corpus = LowercaseAugmentation().augment(corpus)
model = Infinigram(augmented_corpus)
Pattern 2: Robust Text Matching¶
from langcalc.augmentations import StandardAugmentation
# Case + whitespace + Unicode normalization
augmented = StandardAugmentation().augment(corpus)
model = Infinigram(augmented)
Pattern 3: Error Correction¶
from langcalc.projections import EditDistanceProjection
projection = EditDistanceProjection(max_distance=2)
model = ProjectedModel(base_model, projection, corpus)
Pattern 4: Recent Context Focus¶
from langcalc.projections import RecencyProjection
projection = RecencyProjection(max_length=50)
model = ProjectedModel(base_model, projection, corpus)
Troubleshooting¶
Model predictions are all zero¶
Solution: Ensure context exists in corpus or use projections to normalize:
# Add lowercase augmentation for case-insensitive matching
from langcalc.augmentations import LowercaseAugmentation
augmented = LowercaseAugmentation().augment(corpus)
model = Infinigram(augmented)
Out of memory with large corpus¶
Solution: Use smaller max_length or consider streaming approaches:
Predictions are too conservative¶
Solution: Increase temperature or use different mixing weights:
# Higher temperature = more diversity
creative_model = model ** 1.5
# Or reduce n-gram weight
model = 0.8 * llm + 0.2 * ngram # Instead of 0.5/0.5
Getting Help¶
- Examples: Check
/home/spinoza/github/beta/langcalc/examples/for more examples - Tests: See
/home/spinoza/github/beta/langcalc/tests/for reference implementations - Discussions: GitHub Discussions
- Issues: GitHub Issues