Projection System - Complete Documentation Index¶

Overview¶

This index provides a roadmap to the complete projection system documentation for LangCalc. The projection system enables flexible context transformation and corpus augmentation for improved pattern matching in language models.

Documentation Structure¶

1. Core Theory¶

formalism.md ¶

Mathematical foundations of the projection system

Contents: - Basic definitions (corpus, context, language model) - Projection theory (projections as corpus-aware transformations) - Corpus augmentation (normal forms) - Projection-augmentation duality theorem - Projection algebra (composition operations) - Canonical augmentations (case, whitespace, Unicode) - Complexity analysis (space-time tradeoffs) - Projected language models - Future directions (learnable, semantic, adaptive projections)

Key Concepts: - Projection: \(\pi: \Sigma^* \times 2^{\Sigma^*} \to \Sigma^*\) - Augmentation: \(\alpha: 2^{\Sigma^*} \to 2^{\Sigma^*}\) - Duality: \(\text{LMS}(\pi(x, C), C) = \text{LMS}(x, \alpha(C))\) - Composition: \(\pi_1 \circ \pi_2\), \(\pi_1 \sqcup \pi_2\)

Read this: For mathematical understanding and theoretical foundations.

2. Canonical Augmentations¶

augmentations.md ¶

Catalog of standard corpus augmentations (normal forms)

Contents: - Case normalization (lowercase, uppercase, titlecase) - Whitespace normalization (collapsing, stripping) - Unicode normalization (NFC, NFD, NFKC, NFKD) - Punctuation handling (removal, normalization) - Composite augmentations (standard, aggressive) - Language-specific augmentations (ASCII folding) - Augmentation composition (sequential, parallel) - Recommended augmentation sets - Space-time tradeoffs - Implementation checklist - Testing strategy

Key Augmentations: - LowercaseAugmentation - Case-insensitive matching (2× corpus) - CaseAugmentation - Full case coverage (4× corpus) - WhitespaceAugmentation - Format robustness (2× corpus) - NFCAugmentation - Unicode handling (2× corpus) - StandardAugmentation - Recommended default (≈8× corpus)

Read this: For practical augmentation implementation guide.

3. Implementation Reference¶

implementation.md ¶

Complete reference implementation of the projection system

Contents: - Core abstractions (Projection, Augmentation base classes) - Composition implementations (sequential, parallel, weighted) - Basic projections (identity, recency, truncation) - Normalization projections (case, whitespace, Unicode) - Advanced projections (edit distance, longest suffix) - Basic augmentations (case, whitespace) - Model integration (ProjectedModel, MultiProjectionModel) - Usage examples - Testing strategy

Key Classes:

class Projection(ABC):
    def project(self, context: List[int], corpus: List[int]) -> List[int]
    def project_multi(self, context: List[int], corpus: List[int]) -> Set[Tuple[int, ...]]
    def __rshift__(self, other: 'Projection') -> 'Projection'  # >>
    def __or__(self, other: 'Projection') -> 'Projection'      # |

class Augmentation(ABC):
    def augment(self, corpus: List[int]) -> List[int]
    def __add__(self, other: 'Augmentation') -> 'Augmentation'

Read this: For implementation details and code examples.

4. Ordering and Composition¶

ordering.md ¶

Non-commutativity and ordering principles for projection composition

Contents: - Non-commutativity of projections (proof and examples) - When order matters (dependency analysis) - Classes of dependencies (lossy→lossless, correction→transformation, etc.) - Canonical ordering principles - Commutativity classes (identifying commutative pairs) - Practical guidelines (decision tree, testing) - Multi-path projections (exploring multiple orders) - Mathematical properties (associativity, partial ordering) - Case studies (search, code completion, chat)

Key Principles: 1. Error correction first - Fix typos before semantic transformations 2. Normalization before expansion - Normalize, then expand synonyms 3. Semantic before structural - Expand semantic space before matching 4. Lossy operations last - Preserve information as long as possible

Canonical Pipeline:

EditDistance >> Normalize >> Synonym >> LongestSuffix >> Recency

Read this: To understand projection ordering and avoid common mistakes.

Quick Reference¶

When to Use What¶

Goal	Use	Example
Case-insensitive matching	Augmentation	`LowercaseAugmentation()`
Format robustness	Augmentation	`WhitespaceAugmentation()`
Typo correction	Projection	`EditDistanceProjection(max_distance=2)`
Synonym expansion	Projection	`SynonymProjection()`
Context truncation	Projection	`RecencyProjection(max_length=100)`
Unicode compatibility	Augmentation	`NFCAugmentation()`
General text matching	Preset pipeline	`StandardTextProjection`

Decision Tree: Projection vs Augmentation¶

Can the transformation be precomputed?
├─ YES → How expensive is it?
│  ├─ Cheap (case, whitespace) → Use AUGMENTATION
│  └─ Expensive (all variants) → Use PROJECTION
└─ NO (context-dependent) → Use PROJECTION

Examples: - Augmentation: Lowercase (precompute all case variants) - Projection: Edit distance (too many variants to precompute) - Projection: Recency (depends on query context length)

Implementation Roadmap¶

Phase 1: Core Infrastructure (Priority 1)¶

Files to create: - langcalc/projections/base.py - Projection abstract base class - langcalc/augmentations/base.py - Augmentation abstract base class - langcalc/projections/composition.py - SequentialProjection, ParallelProjection

Implement: - [ ] Projection base class with composition operators - [ ] Augmentation base class with composition - [ ] IdentityProjection - [ ] Unit tests for composition

Phase 2: Basic Projections (Priority 1)¶

File: langcalc/projections/basic.py

Implement: - [ ] RecencyProjection(max_length) - [ ] TruncationProjection(max_length) - [ ] LowercaseProjection() - [ ] UppercaseProjection() - [ ] WhitespaceProjection() - [ ] UnicodeNormalizationProjection(form='NFC') - [ ] Unit tests for each

Phase 3: Basic Augmentations (Priority 1)¶

File: langcalc/augmentations/basic.py

Implement: - [ ] LowercaseAugmentation() - [ ] CaseAugmentation() - [ ] WhitespaceAugmentation() - [ ] NFCAugmentation() - [ ] StandardAugmentation() (preset) - [ ] Unit tests for each

Phase 4: Model Integration (Priority 1)¶

File: langcalc/models/projected.py

Implement: - [ ] ProjectedModel(base_model, projection, corpus) - [ ] MultiProjectionModel(base_model, weighted_projections, corpus) - [ ] Update InfinigramModel to accept projection/augmentation - [ ] Integration tests

Phase 5: Advanced Projections (Priority 2)¶

File: langcalc/projections/advanced.py

Implement: - [ ] EditDistanceProjection(max_distance) - [ ] LongestSuffixProjection(min_length) - [ ] SynonymProjection() (requires WordNet/embedding) - [ ] Unit tests

Phase 6: Advanced Augmentations (Priority 2)¶

File: langcalc/augmentations/advanced.py

Implement: - [ ] UnicodeAugmentation() (all forms) - [ ] PunctuationAugmentation() - [ ] NoPunctuationAugmentation() - [ ] ASCIIFoldingAugmentation() - [ ] Unit tests

Phase 7: Presets and Utilities (Priority 3)¶

Files: - langcalc/projections/presets.py - langcalc/augmentations/presets.py

Implement: - [ ] StandardTextProjection pipeline - [ ] CodeCompletionProjection pipeline - [ ] ChatMessageProjection pipeline - [ ] MinimalAugmentation preset - [ ] AggressiveAugmentation preset - [ ] Validation utilities - [ ] Commutativity testing utilities

Testing Strategy¶

Unit Tests¶

Test each projection individually:

def test_lowercase_projection():
    proj = LowercaseProjection()
    context = list("Hello".encode('utf-8'))
    result = proj.project(context, corpus=[])
    assert bytes(result).decode('utf-8') == "hello"

Test each augmentation individually:

def test_lowercase_augmentation():
    aug = LowercaseAugmentation()
    corpus = list("Hello".encode('utf-8'))
    result = aug.augment(corpus)
    text = bytes(result).decode('utf-8')
    assert "Hello" in text and "hello" in text

Integration Tests¶

Test projection-augmentation duality:

n>

def test_duality(): corpus = list("Hello World".encode('utf-8')) context = list("HELLO".encode('utf-8')) # Approach 1: Project query proj_model = ProjectedModel( InfinigramModel(corpus), LowercaseProjection(), corpus ) # Approach 2: Augment corpus aug_corpus = LowercaseAugmentation().augment(corpus) aug_model = InfinigramModel(aug_corpus) # Should produce similar results tokens = list(range(256)) probs1 = proj_model.logprobs(tokens, context) probs2 = aug_model.logprobs(tokens, context) assert np.allclose(probs1, probs2, atol=0.1)
 Test composition: 
def test_composition():
    pipeline = (
        WhitespaceProjection() >>
        LowercaseProjection() >>
        RecencyProjection(10)
    )

    context = list("  HELLO  WORLD  ".encode('utf-8'))
    result = pipeline.project(context, corpus=[])

    # Should normalize, lowercase, then truncate
    text = bytes(result).decode('utf-8')
    assert text == "hello world"  # normalized and lowercased
    assert len(result) <= 10 * 4  # truncated (UTF-8 max 4 bytes/char)
 Property-Based Tests¶
 from hypothesis import given, strategies as st

@given(st.lists(st.integers(0, 255)))
def test_identity_projection_is_identity(context):
    proj = IdentityProjection()
    assert proj.project(context, corpus=[]) == context

@given(st.lists(st.integers(0, 255)))
def test_augmentation_includes_original(corpus):
    aug = LowercaseAugmentation()
    result = aug.augment(corpus)
    assert corpus == result[:len(corpus)]  # Original is prefix
 
 Usage Examples¶
 Example 1: Simple Case-Insensitive Model¶
 from langcalc.models import InfinigramModel
from langcalc.augmentations import LowercaseAugmentation

# Create corpus with lowercase augmentation
corpus = list("Hello World".encode('utf-8'))
augmented = LowercaseAugmentation().augment(corpus)

# Create model
model = InfinigramModel(augmented)

# Query with uppercase - will match
context = list("HELLO".encode('utf-8'))
probs = model.logprobs(list(range(256)), context)
 Example 2: Projection Pipeline¶
 from langcalc.models import InfinigramModel
from langcalc.projections import (
    WhitespaceProjection, LowercaseProjection, RecencyProjection
)
from langcalc.models.projected import ProjectedModel

# Define pipeline
projection = (
    WhitespaceProjection() >>
    LowercaseProjection() >>
    RecencyProjection(max_length=100)
)

# Create model
corpus = list("the cat sat on the mat".encode('utf-8'))
base_model = InfinigramModel(corpus)
model = ProjectedModel(base_model, projection, corpus)

# Query with messy input
context = list("  THE  CAT  ".encode('utf-8'))
samples = model.sample(context, max_tokens=10)
 Example 3: Multi-Projection Model¶
 from langcalc.projections import IdentityProjection, LowercaseProjection
from langcalc.models.projected import MultiProjectionModel

# Try multiple projections with weights
projections = [
    (IdentityProjection(), 0.7),      # Original: 70%
    (LowercaseProjection(), 0.3),     # Lowercase: 30%
]

model = MultiProjectionModel(
    InfinigramModel(corpus),
    projections,
    corpus
)

# Model tries both projections, weighted mixture
probs = model.logprobs(tokens, context)
 Example 4: Standard Text Pipeline¶
 from langcalc.projections.presets import StandardTextProjection
from langcalc.models.projected import ProjectedModel

# Use preset pipeline
projection = StandardTextProjection()

model = ProjectedModel(
    InfinigramModel(corpus),
    projection,
    corpus
)
 
 Migration Guide¶
 From NGramModel with Projections¶
 Old code: 
from langcalc.models.ngram import NGramModel
from langcalc.projections import RecencyProjection

model = NGramModel(corpus, n=3, projection=RecencyProjection(decay=0.9))
 New code: 
from langcalc.models import InfinigramModel
from langcalc.projections import RecencyProjection
from langcalc.models.projected import ProjectedModel

base_model = InfinigramModel(corpus, max_length=3)
projection = RecencyProjection(max_length=10)
model = ProjectedModel(base_model, projection, corpus)
 From Infinigram Augmentations¶
 Infinigram augmentations (corpus-level): 
# Infinigram's augmentation (training-time)
from infinigram import augment
augmented_corpus = augment(corpus, ['lowercase', 'uppercase'])
 LangCalc augmentations (equivalent): 
# LangCalc's augmentation (training-time)
from langcalc.augmentations import CaseAugmentation

augmented_corpus = CaseAugmentation().augment(corpus)
 Infinigram recursive transformers (query-time): 
# Infinigram's recursive transformer (query-time)
from infinigram.recursive import SynonymTransformer

model = RecursiveInfinigram(corpus, transformers=[SynonymTransformer()])
 LangCalc projections (equivalent): 
# LangCalc's projection (query-time)
from langcalc.projections import SynonymProjection
from langcalc.models.projected import ProjectedModel

projection = SynonymProjection()
model = ProjectedModel(InfinigramModel(corpus), projection, corpus)
 
 API Summary¶
 Core Classes¶
 # Abstract base classes
from langcalc.projections import Projection
from langcalc.augmentations import Augmentation

# Basic projections
from langcalc.projections import (
    IdentityProjection,
    RecencyProjection,
    TruncationProjection,
    LowercaseProjection,
    WhitespaceProjection,
    UnicodeNormalizationProjection,
)

# Advanced projections
from langcalc.projections.advanced import (
    EditDistanceProjection,
    LongestSuffixProjection,
    SynonymProjection,
)

# Basic augmentations
from langcalc.augmentations import (
    LowercaseAugmentation,
    CaseAugmentation,
    WhitespaceAugmentation,
    NFCAugmentation,
    StandardAugmentation,
)

# Advanced augmentations
from langcalc.augmentations.advanced import (
    UnicodeAugmentation,
    PunctuationAugmentation,
    ASCIIFoldingAugmentation,
)

# Model integration
from langcalc.models.projected import (
    ProjectedModel,
    MultiProjectionModel,
)

# Presets
from langcalc.projections.presets import StandardTextProjection
from langcalc.augmentations.presets import (
    MinimalAugmentation,
    AggressiveAugmentation,
)
 Composition Operators¶
 # Sequential composition (left-to-right)
pipeline = proj1 >> proj2 >> proj3

# Parallel composition (union)
multi = proj1 | proj2 | proj3

# Augmentation composition
augmentation = aug1 + aug2 + aug3
 
 References¶
 Related Documentation¶
  PROJECTIONS_COMPARISON.md - Comparison with Infinigram's concepts
 OLLAMA_NGRAM_SUMMARY.md - NGramModel removal plan
 CURRENT_STATUS.md - Current implementation status
 
 External Resources¶
  Infinigram paper: Infinigram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
 Suffix arrays: Suffix Array - Wikipedia
 Unicode normalization: UAX #15: Unicode Normalization Forms
 WordNet: WordNet - Princeton
 
 
 Contributing¶
 When adding new projections or augmentations:
  Document ordering constraints in the docstring
 Add unit tests for correctness
 Test commutativity with existing projections
 Update this index with your addition
 Add examples to the reference implementation
 
 Projection Template¶
 class MyProjection(Projection):
    """
    Brief description.

    Ordering constraints:
    - AFTER: [projections that should come before]
    - BEFORE: [projections that should come after]
    - COMMUTES WITH: [projections that commute with this]

    Args:
        param: Parameter description

    Example:
        >>> proj = MyProjection(param=value)
        >>> result = proj.project(context, corpus)
    """

    def __init__(self, param):
        self.param = param

    def project(self, context: List[int], corpus: List[int]) -> List[int]:
        # Implementation
        pass

    def __repr__(self) -> str:
        return f"MyProjection(param={self.param})"
 
 Summary¶
 The projection system provides:
  Flexible context transformation - Project queries onto corpus
 Efficient corpus augmentation - Precompute common transformations
 Composable algebra - Build complex pipelines from simple parts
 Principled ordering - Guidelines for non-commutative composition
 Duality theorem - Trade space for time via augmentation
 
 Start here: - Theory: Read formalism.md - Practice: Read augmentations.md - Code: Read implementation.md - Composition: Read ordering.md
 Next steps: - Implement Phase 1 (core infrastructure) - Implement Phase 2 (basic projections) - Implement Phase 3 (basic augmentations) - Update InfinigramModel to support projections/augmentations - Remove NGramModel (once projection system is complete)

Projection System - Complete Documentation Index¶

Overview¶

Documentation Structure¶

1. Core Theory¶

formalism.md¶

2. Canonical Augmentations¶

augmentations.md¶

3. Implementation Reference¶

implementation.md¶

4. Ordering and Composition¶

ordering.md¶

Quick Reference¶

When to Use What¶

Decision Tree: Projection vs Augmentation¶

Implementation Roadmap¶

Phase 1: Core Infrastructure (Priority 1)¶

Phase 2: Basic Projections (Priority 1)¶

Phase 3: Basic Augmentations (Priority 1)¶

Phase 4: Model Integration (Priority 1)¶

Phase 5: Advanced Projections (Priority 2)¶

Phase 6: Advanced Augmentations (Priority 2)¶

Phase 7: Presets and Utilities (Priority 3)¶

Testing Strategy¶

Unit Tests¶

Integration Tests¶

Property-Based Tests¶

Usage Examples¶

Example 1: Simple Case-Insensitive Model¶

Example 2: Projection Pipeline¶

Example 3: Multi-Projection Model¶

Example 4: Standard Text Pipeline¶

Migration Guide¶

From NGramModel with Projections¶

From Infinigram Augmentations¶

API Summary¶

Core Classes¶

Composition Operators¶

References¶

Related Documentation¶

External Resources¶

Contributing¶

Projection Template¶

Summary¶

formalism.md ¶

augmentations.md ¶

implementation.md ¶

ordering.md ¶