Projection System Overview¶
The LangCalc projection system provides a rigorous mathematical framework for context transformation and corpus augmentation in language modeling.
What are Projections?¶
A projection is a transformation that maps a query context onto a corpus, enabling flexible pattern matching and generalization.
Mathematical Definition:
A projection \(\pi\) takes:
- Input: Query context \(x \in \Sigma^*\) and corpus \(C \subseteq 2^{\Sigma^*}\)
- Output: Transformed context \(\pi(x, C) \in \Sigma^*\)
Example:
from langcalc.projections import LowercaseProjection
projection = LowercaseProjection()
# Transform "HELLO" to "hello" before matching
context = list("HELLO".encode('utf-8'))
transformed = projection.project(context, corpus)
# Result: list("hello".encode('utf-8'))
What are Augmentations?¶
An augmentation expands the corpus by adding transformed variants.
Mathematical Definition:
An augmentation \(\alpha\) takes:
- Input: Corpus \(C\)
- Output: Augmented corpus \(\alpha(C)\) containing \(C\) plus variants
Example:
from langcalc.augmentations import LowercaseAugmentation
augmentation = LowercaseAugmentation()
# Add lowercase variant to corpus
corpus = list("Hello World".encode('utf-8'))
augmented = augmentation.augment(corpus)
# Result: corpus + list("hello world".encode('utf-8'))
Key Innovation: Projection-Augmentation Duality¶
Theorem (Duality): For certain transformations, projections and augmentations are equivalent:
This means:
- Projecting the query onto the original corpus
- Augmenting the corpus and using the original query
...produce the same matching results!
Practical Implication: Choose the more efficient approach:
- Simple transformations (case, whitespace) → Use augmentation (pay space, save time)
- Complex transformations (edit distance, semantic) → Use projection (save space, pay time)
System Components¶
1. Mathematical Formalism¶
- Formal definitions of projections and augmentations
- Projection algebra (composition operations)
- Complexity analysis
- Projected language models
2. Canonical Augmentations¶
Standard corpus augmentations:
- Case normalization: lowercase, uppercase, titlecase
- Whitespace normalization: collapsing, stripping
- Unicode normalization: NFC, NFD, NFKC, NFKD
- Punctuation handling: removal, normalization
- Composite augmentations: standard, aggressive
3. Ordering Principles¶
Projections are non-commutative - order matters!
Canonical pipeline:
EditDistance >> Normalize >> Synonym >> LongestSuffix >> Recency
↓ ↓ ↓ ↓ ↓
Fix typos Standardize Expand Find patterns Focus context
4. Reference Implementation¶
Complete Python implementation:
- Abstract base classes (
Projection,Augmentation) - Composition operators (
>>,|,+) - Basic and advanced projections
- Model integration (
ProjectedModel,MultiProjectionModel)
Quick Examples¶
Example 1: Case-Insensitive Matching¶
Using Projection:
from langcalc.projections import LowercaseProjection
from langcalc.models.projected import ProjectedModel
projection = LowercaseProjection()
model = ProjectedModel(base_model, projection, corpus)
Using Augmentation:
from langcalc.augmentations import LowercaseAugmentation
augmented_corpus = LowercaseAugmentation().augment(corpus)
model = Infinigram(augmented_corpus)
Both achieve case-insensitive matching!
Example 2: Robust Text Matching¶
from langcalc.projections import (
WhitespaceProjection,
LowercaseProjection,
RecencyProjection
)
# Chain projections
projection = (
WhitespaceProjection() >> # Normalize whitespace
LowercaseProjection() >> # Case-insensitive
RecencyProjection(100) # Keep recent 100 tokens
)
model = ProjectedModel(base_model, projection, corpus)
Example 3: Standard Augmentation¶
from langcalc.augmentations import StandardAugmentation
# Case + whitespace + Unicode NFC (≈8× corpus)
augmented = StandardAugmentation().augment(corpus)
model = Infinigram(augmented)
When to Use What¶
| Goal | Approach | Example |
|---|---|---|
| Case-insensitive | Augmentation | LowercaseAugmentation() |
| Format robustness | Augmentation | WhitespaceAugmentation() |
| Typo correction | Projection | EditDistanceProjection(max_distance=2) |
| Synonym expansion | Projection | SynonymProjection() |
| Context truncation | Projection | RecencyProjection(max_length=100) |
| Unicode compatibility | Augmentation | NFCAugmentation() |
Decision Tree:
Can the transformation be precomputed?
├─ YES → How much memory available?
│ ├─ Plenty (2-10× corpus) → Use AUGMENTATION
│ └─ Limited → Use PROJECTION
└─ NO (context-dependent) → Use PROJECTION
Space-Time Tradeoffs¶
| Augmentation | Space Multiplier | Query Time Saved | When to Use |
|---|---|---|---|
| Lowercase | 2× | Significant | Almost always |
| Full Case | 4× | Significant | Case-insensitive search |
| Whitespace | 2× | Moderate | Mixed formatting |
| Unicode NFC | 2× | Significant | International text |
| Full Unicode | 5× | Significant | Maximum compatibility |
| Standard | ≈8× | High | General purpose |
| Aggressive | ≈20× | Very High | Large corpora only |
Rule of thumb: If you have memory for \(k\times\) corpus expansion, use augmentation. Otherwise, use projection.
Implementation Roadmap¶
Phase 1: Core Infrastructure ✓¶
-
Projectionabstract base class -
Augmentationabstract base class - Composition operators (
>>,|,+)
Phase 2: Basic Projections ✓¶
-
IdentityProjection -
RecencyProjection -
LowercaseProjection -
WhitespaceProjection -
UnicodeNormalizationProjection
Phase 3: Basic Augmentations ✓¶
-
LowercaseAugmentation -
CaseAugmentation -
WhitespaceAugmentation -
NFCAugmentation -
StandardAugmentation
Phase 4: Model Integration 🚧¶
-
ProjectedModel(base_model, projection, corpus) -
MultiProjectionModel(base_model, weighted_projections, corpus) - Update
InfinigramModelto accept projections/augmentations
Phase 5: Advanced Projections (Future)¶
-
EditDistanceProjection -
LongestSuffixProjection -
SynonymProjection
Phase 6: Presets and Utilities (Future)¶
-
StandardTextProjectionpipeline -
CodeCompletionProjectionpipeline - Validation utilities
Documentation Structure¶
- Mathematical Formalism - Rigorous definitions and theorems
- Canonical Augmentations - Catalog of standard transformations
- Ordering Principles - Non-commutativity and canonical pipelines
- Reference Implementation - Complete code reference
- Index - Complete roadmap and API summary
Research Contributions¶
The projection system makes several novel contributions:
- Unified Framework: Treating projections and augmentations within a single mathematical formalism
- Duality Theorem: Proving equivalence between query-time projection and training-time augmentation
- Non-Commutativity Analysis: Establishing ordering principles for projection composition
- Canonical Augmentations: Comprehensive catalog of standard transformations
- Space-Time Tradeoffs: Quantitative analysis of augmentation costs
Next Steps¶
Explore the complete documentation:
- New to projections? Start with Mathematical Formalism
- Want to implement? See Reference Implementation
- Building pipelines? Read Ordering Principles
- Looking for augmentations? Browse Canonical Augmentations
- Complete reference? Check the Index