Canonical Corpus Augmentations¶
Overview¶
This document catalogs the standard corpus augmentations (normal forms) that should be supported in LangCalc. Based on the projection-augmentation duality theorem, these augmentations implement common projections efficiently by transforming the corpus once at training time rather than transforming every query.
1. Case Normalization¶
1.1 Lowercase Normalization¶
Mathematical Definition: \(\(\alpha_{\text{lower}}(C) = C \cup \{\text{lowercase}(s) : s \in C\}\)\)
Purpose: Enable case-insensitive matching.
Effect: Doubles corpus size (original + lowercase variant).
Implementation:
class LowercaseAugmentation(Augmentation):
"""Augment corpus with lowercase variant."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
lower_text = text.lower()
lower_bytes = list(lower_text.encode('utf-8'))
return corpus + lower_bytes
except UnicodeDecodeError:
# If corpus is not valid UTF-8, return unchanged
return corpus
Example:
1.2 Full Case Augmentation¶
Mathematical Definition: \(\(\alpha_{\text{case}}(C) = C \cup \{\text{lower}(C), \text{upper}(C), \text{title}(C)\}\)\)
Purpose: Maximize case-insensitive matching coverage.
Effect: 4× corpus size (original + 3 variants).
Implementation:
class CaseAugmentation(Augmentation):
"""Augment corpus with all case variants."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
variants = [
text, # original
text.lower(), # lowercase
text.upper(), # uppercase
text.title(), # titlecase
]
return [byte for variant in variants
for byte in variant.encode('utf-8')]
except UnicodeDecodeError:
return corpus
Tradeoff: Uses 4× space but completely eliminates case sensitivity.
2. Whitespace Normalization¶
2.1 Whitespace Collapsing¶
Mathematical Definition: \(\(\alpha_{\text{ws}}(C) = C \cup \{\text{collapse\_ws}(C)\}\)\)
where collapse_ws replaces sequences of whitespace characters with single space.
Purpose: Handle formatting variations (tabs, multiple spaces, etc.).
Implementation:
import re
class WhitespaceAugmentation(Augmentation):
"""Augment corpus with normalized whitespace."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
# Collapse consecutive whitespace to single space
normalized = re.sub(r'\s+', ' ', text)
normalized_bytes = list(normalized.encode('utf-8'))
return corpus + normalized_bytes
except UnicodeDecodeError:
return corpus
Example:
2.2 Whitespace Stripping¶
Mathematical Definition: \(\(\alpha_{\text{strip}}(C) = C \cup \{\text{strip}(C)\}\)\)
Purpose: Remove leading/trailing whitespace.
Implementation:
class StripAugmentation(Augmentation):
"""Augment corpus with stripped variant."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
stripped = text.strip()
return corpus + list(stripped.encode('utf-8'))
except UnicodeDecodeError:
return corpus
3. Unicode Normalization¶
3.1 NFC Normalization¶
Mathematical Definition: \(\(\alpha_{\text{NFC}}(C) = C \cup \{\text{NFC}(C)\}\)\)
where NFC is Unicode Normalization Form C (Canonical Composition).
Purpose: Handle different Unicode representations of same character (e.g., é as single character vs e + combining accent).
Implementation:
import unicodedata
class NFCAugmentation(Augmentation):
"""Augment corpus with NFC normalized variant."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
nfc_text = unicodedata.normalize('NFC', text)
return corpus + list(nfc_text.encode('utf-8'))
except UnicodeDecodeError:
return corpus
3.2 Full Unicode Normalization¶
Mathematical Definition: \(\(\alpha_{\text{unicode}}(C) = C \cup \{\text{NFC}(C), \text{NFD}(C), \text{NFKC}(C), \text{NFKD}(C)\}\)\)
Purpose: Maximum Unicode compatibility.
Effect: 5× corpus size.
Implementation:
class UnicodeAugmentation(Augmentation):
"""Augment corpus with all Unicode normal forms."""
FORMS = ['NFC', 'NFD', 'NFKC', 'NFKD']
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
variants = [text] # original
for form in self.FORMS:
normalized = unicodedata.normalize(form, text)
variants.append(normalized)
return [byte for variant in variants
for byte in variant.encode('utf-8')]
except UnicodeDecodeError:
return corpus
4. Punctuation Handling¶
4.1 Punctuation Removal¶
Mathematical Definition: \(\(\alpha_{\text{nopunct}}(C) = C \cup \{\text{remove\_punct}(C)\}\)\)
Purpose: Match content regardless of punctuation.
Implementation:
import string
class NoPunctuationAugmentation(Augmentation):
"""Augment corpus with punctuation removed."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
# Remove all punctuation
no_punct = text.translate(str.maketrans('', '', string.punctuation))
return corpus + list(no_punct.encode('utf-8'))
except UnicodeDecodeError:
return corpus
Example:
4.2 Punctuation Normalization¶
Mathematical Definition: \(\(\alpha_{\text{punct}}(C) = C \cup \{\text{normalize\_punct}(C)\}\)\)
where normalization converts fancy quotes/dashes to ASCII equivalents.
Implementation:
class PunctuationAugmentation(Augmentation):
"""Augment corpus with normalized punctuation."""
PUNCT_MAP = {
'\u2018': "'", # Left single quote
'\u2019': "'", # Right single quote
'\u201C': '"', # Left double quote
'\u201D': '"', # Right double quote
'\u2013': '-', # En dash
'\u2014': '-', # Em dash
'\u2026': '...', # Ellipsis
}
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
for fancy, simple in self.PUNCT_MAP.items():
text = text.replace(fancy, simple)
return corpus + list(text.encode('utf-8'))
except UnicodeDecodeError:
return corpus
5. Composite Augmentations¶
5.1 Standard Normalization¶
Mathematical Definition: \(\(\alpha_{\text{std}} = \alpha_{\text{case}} + \alpha_{\text{ws}} + \alpha_{\text{NFC}}\)\)
Purpose: Common baseline normalization (case + whitespace + Unicode).
Implementation:
class StandardAugmentation(Augmentation):
"""Standard normalization: case + whitespace + Unicode NFC."""
def __init__(self):
self.augmentations = [
CaseAugmentation(),
WhitespaceAugmentation(),
NFCAugmentation(),
]
def augment(self, corpus: List[int]) -> List[int]:
result = corpus
for aug in self.augmentations:
result = aug.augment(result)
return result
Effect: Significantly larger corpus but handles most common variations.
5.2 Aggressive Normalization¶
Mathematical Definition: \(\(\alpha_{\text{aggressive}} = \alpha_{\text{case}} + \alpha_{\text{ws}} + \alpha_{\text{unicode}} + \alpha_{\text{nopunct}}\)\)
Purpose: Maximum robustness to formatting differences.
Warning: Very large corpus expansion.
6. Language-Specific Augmentations¶
6.1 ASCII Folding¶
Mathematical Definition: \(\(\alpha_{\text{ascii}}(C) = C \cup \{\text{to\_ascii}(C)\}\)\)
where accented characters are converted to ASCII equivalents (é → e).
Purpose: Match across accented/unaccented variants.
Implementation:
class ASCIIFoldingAugmentation(Augmentation):
"""Augment corpus with ASCII-folded variant."""
def augment(self, corpus: List[int]) -> List[int]:
try:
text = bytes(corpus).decode('utf-8')
# Decompose to NFD and remove combining marks
nfd = unicodedata.normalize('NFD', text)
ascii_text = ''.join(
char for char in nfd
if unicodedata.category(char) != 'Mn' # Mn = Mark, Nonspacing
)
return corpus + list(ascii_text.encode('utf-8'))
except UnicodeDecodeError:
return corpus
Example:
7. Augmentation Composition¶
7.1 Sequential Composition¶
Apply augmentations in sequence:
class SequentialAugmentation(Augmentation):
"""Compose augmentations sequentially."""
def __init__(self, *augmentations: Augmentation):
self.augmentations = augmentations
def augment(self, corpus: List[int]) -> List[int]:
result = corpus
for aug in self.augmentations:
result = aug.augment(result)
return result
# Usage
aug = SequentialAugmentation(
CaseAugmentation(),
WhitespaceAugmentation(),
NFCAugmentation()
)
7.2 Parallel Composition¶
Apply augmentations independently and concatenate:
class ParallelAugmentation(Augmentation):
"""Compose augmentations in parallel."""
def __init__(self, *augmentations: Augmentation):
self.augmentations = augmentations
def augment(self, corpus: List[int]) -> List[int]:
# Start with original
result = corpus
# Add each augmentation's output
for aug in self.augmentations:
augmented = aug.augment(corpus) # Apply to original
# Add only the new variants (skip original)
result.extend(augmented[len(corpus):])
return result
8. Recommended Augmentation Sets¶
8.1 Minimal (2× corpus)¶
Use case: Small corpora, case-insensitive matching.8.2 Standard (≈8× corpus)¶
# Case + whitespace + Unicode NFC
aug = SequentialAugmentation(
CaseAugmentation(), # 4×
WhitespaceAugmentation(), # 2×
NFCAugmentation() # 2×
)
8.3 Aggressive (≈20× corpus)¶
# Everything
aug = SequentialAugmentation(
CaseAugmentation(), # 4×
WhitespaceAugmentation(), # 2×
UnicodeAugmentation(), # 5×
NoPunctuationAugmentation(), # 2×
)
8.4 Web Text (≈10× corpus)¶
# Case + whitespace + punctuation + Unicode
aug = SequentialAugmentation(
CaseAugmentation(),
WhitespaceAugmentation(),
PunctuationAugmentation(),
NFCAugmentation()
)
9. Space-Time Tradeoffs¶
| Augmentation | Space Multiplier | Query Time Saved | When to Use |
|---|---|---|---|
| Lowercase | 2× | Significant | Almost always |
| Full Case | 4× | Significant | Case-insensitive search |
| Whitespace | 2× | Moderate | Mixed formatting |
| Unicode NFC | 2× | Significant | International text |
| Full Unicode | 5× | Significant | Maximum compatibility |
| No Punctuation | 2× | Moderate | Content-focused matching |
| Standard | ≈8× | High | General purpose |
| Aggressive | ≈20× | Very High | Large corpora only |
Rule of thumb: If you have memory for \(k\times\) corpus expansion, use augmentation. Otherwise, use query-time projection.
10. Implementation Checklist¶
Priority 1 (Must Have)¶
-
LowercaseAugmentation- Case insensitive matching -
WhitespaceAugmentation- Format robustness -
NFCAugmentation- Unicode handling
Priority 2 (Should Have)¶
-
CaseAugmentation- Full case coverage -
StripAugmentation- Trim whitespace -
ASCIIFoldingAugmentation- ASCII compatibility
Priority 3 (Nice to Have)¶
-
UnicodeAugmentation- Full Unicode coverage -
PunctuationAugmentation- Punctuation normalization -
NoPunctuationAugmentation- Content matching
Composition¶
-
SequentialAugmentation- Chain augmentations -
ParallelAugmentation- Independent augmentations
Presets¶
-
StandardAugmentation- Recommended default -
MinimalAugmentation- Space-efficient -
AggressiveAugmentation- Maximum robustness
11. Testing Strategy¶
Each augmentation should be tested for:
- Correctness: Augmented corpus contains expected variants
- UTF-8 safety: Handles invalid UTF-8 gracefully
- Idempotency:
aug.augment(aug.augment(corpus))is predictable - Composition: Sequential/parallel composition works correctly
- Edge cases: Empty corpus, non-text data, special characters
Example test:
def test_lowercase_augmentation():
corpus = list("Hello World".encode('utf-8'))
aug = LowercaseAugmentation()
result = aug.augment(corpus)
# Should contain original + lowercase
text = bytes(result).decode('utf-8')
assert "Hello World" in text
assert "hello world" in text
# Should be exactly 2× original length
assert len(result) == 2 * len(corpus)
Conclusion¶
These canonical augmentations implement common normalization needs efficiently. The key insight from the projection-augmentation duality is:
For simple transformations, pay space (augmentation) to save time (per-query projection).
This catalog provides a reference implementation for the most common use cases.