Infinigram Benchmarks and Evaluation¶
Overview¶
This document presents benchmark results comparing vanilla Infinigram with RecursiveInfinigram on both in-distribution and out-of-distribution (OOD) test scenarios.
Methodology¶
Corpus¶
- Size: 50,021 bytes
- Type: Synthetic English text
- Content: Common English words in random sentences
Models Compared¶
- Vanilla Infinigram: Standard n-gram language model with suffix array
- RecursiveInfinigram: Enhanced model with transformation-based OOD handling
- CaseNormalizer
- EditDistanceTransformer (max_distance=1)
- SynonymTransformer (WordNet-based)
- TransformationScorer (multi-factor weighting)
Test Datasets¶
- In-Distribution (200 samples): Random excerpts from training corpus
- OOD: Case Changes (200 samples): Random case flips (30% of letters)
- OOD: Typos (200 samples): 1-2 character substitutions
- OOD: Case + Typos (200 samples): Combined transformations
Metrics¶
- Accuracy: % of correct top-1 predictions
- Top-k Accuracy: % where correct prediction is in top-k
- Coverage: % of contexts that have predictions
- Perplexity: Language model quality metric (lower = better)
- Mean Probability: Average probability assigned to correct token
- Mean Rank: Average rank of correct prediction
- Time: Average prediction time in milliseconds
Results Summary¶
Comparison Table¶
| Dataset | Model | Accuracy | Top-3 Acc | Perplexity | Time (ms) |
|---|---|---|---|---|---|
| In-Distribution | Vanilla | 100.0% | 100.0% | 1.00 | 1.46 |
| Recursive | 100.0% | 100.0% | 1.00 | 3.92 | |
| OOD: Case Changes | Vanilla | 52.5% | 67.0% | 3.05 | 1.23 |
| Recursive | 74.5% | 81.5% | 1.95 | 96.12 | |
| OOD: Typos | Vanilla | 76.5% | 86.0% | 1.59 | 0.99 |
| Recursive | 82.5% | 88.5% | 1.43 | 20.44 | |
| OOD: Case + Typos | Vanilla | 40.5% | 63.0% | 3.85 | 1.17 |
| Recursive | 61.0% | 73.0% | 2.31 | 71.33 |
Improvement Over Vanilla¶
| Dataset | Accuracy Δ | Perplexity Δ |
|---|---|---|
| In-Distribution | +0.0% | +0.0% |
| OOD: Case Changes | +22.0% | +36.2% |
| OOD: Typos | +6.0% | +10.2% |
| OOD: Case + Typos | +20.5% | +40.0% |
Detailed Analysis¶
In-Distribution Performance¶
Both models achieve perfect performance (100% accuracy, perplexity 1.00) on in-distribution data, as expected. RecursiveInfinigram is slightly slower (3.92ms vs 1.46ms) due to transformation overhead, but this is negligible in practice.
OOD: Case Changes¶
Scenario: Input has random case changes (e.g., "ThE QuIcK BroWN")
Results: - Vanilla struggles with case variations (52.5% accuracy) - RecursiveInfinigram handles them well via CaseNormalizer (74.5% accuracy) - 22 percentage point improvement in accuracy - 36% reduction in perplexity
Why it works: CaseNormalizer transforms "ThE QuIcK" → "the quick" before querying, finding matches in the original corpus.
OOD: Typos¶
Scenario: Input has 1-2 character typos (e.g., "the qwick brown")
Results: - Vanilla has some resilience due to partial matches (76.5% accuracy) - RecursiveInfinigram improves via EditDistanceTransformer (82.5% accuracy) - 6 percentage point improvement in accuracy - 10% reduction in perplexity
Why it works: EditDistanceTransformer detects corpus words within edit distance 1, suggesting "qwick" → "quick" correction.
OOD: Case + Typos (Hardest)¶
Scenario: Combined case changes and typos (e.g., "ThE qwIcK BroWN")
Results: - Vanilla struggles significantly (40.5% accuracy) - RecursiveInfinigram shows strong robustness (61.0% accuracy) - 20.5 percentage point improvement in accuracy - 40% reduction in perplexity
Why it works: Multiple transformers can chain together: 1. CaseNormalizer: "ThE qwIcK" → "the qwick" 2. EditDistanceTransformer: "the qwick" → "the quick" 3. Result: Successfully recovers original context
Performance Trade-offs¶
Speed¶
RecursiveInfinigram is slower than vanilla due to transformation overhead:
- In-Distribution: 2.7x slower (3.92ms vs 1.46ms)
- OOD: Case Changes: 78x slower (96.12ms vs 1.23ms)
- OOD: Typos: 21x slower (20.44ms vs 0.99ms)
- OOD: Case + Typos: 61x slower (71.33ms vs 1.17ms)
Why: RecursiveInfinigram tries multiple transformations, each requiring: - Transformation generation (corpus inspection) - New suffix array queries - Scoring and ranking
Mitigation strategies: - Cache transformation results - Limit transformation depth - Use conservative scorer when corpus coverage is high - Prune low-scoring transformations early
Memory¶
RecursiveInfinigram has similar memory footprint to vanilla: - Same suffix array - Additional: transformation cache (~KB) - Additional: scorer state (~bytes)
When to Use Each Model¶
Use Vanilla Infinigram When:¶
- Input is in-distribution (matches training corpus well)
- Speed is critical
- Corpus coverage is very high
- You don't expect typos, case variations, or paraphrasing
Use RecursiveInfinigram When:¶
- Input may have typos or spelling errors
- Input may have case variations
- Input may use synonyms not in corpus
- Robustness is more important than speed
- Working with OOD data (new domains, user-generated content)
Scoring System Impact¶
The TransformationScorer uses multi-factor weighting:
- Match Length (40%): Longer suffix matches are better
- Match Frequency (20%): More common patterns are more confident
- Transformation Quality (30%): Different transformers have different reliability
- Depth Penalty (10%): Fewer transformations are better
Transformer Reliability Weights¶
- Case normalization: 0.99 (very reliable)
- Typo correction: 0.95 (reliable)
- Synonym replacement: 0.85 (less reliable, meaning can shift)
This weighting ensures that: - Original context (no transformations) scores highest when it has good matches - Case-normalized contexts score almost as high as originals - Synonym-based transformations are used cautiously
Conclusion¶
RecursiveInfinigram demonstrates significant improvements on out-of-distribution data:
- +22% accuracy on case variations
- +6% accuracy on typos
- +20.5% accuracy on combined challenges
- Up to 40% perplexity reduction on hard OOD scenarios
The trade-off is higher latency (2-78x slower), but this is acceptable for many applications where robustness is critical.
The transformation-based approach successfully handles: - ✅ Case variations - ✅ Typos and spelling errors - ✅ Synonyms (via WordNet) - ✅ Multiple simultaneous transformations
Running Benchmarks¶
To reproduce these results:
To run on your own corpus:
from infinigram.evaluation import BenchmarkSuite
from infinigram.infinigram import Infinigram
from infinigram.recursive import RecursiveInfinigram, CaseNormalizer
# Your corpus
corpus = b"your training data here"
# Create models
vanilla = Infinigram(corpus)
recursive = RecursiveInfinigram(corpus, transformers=[CaseNormalizer()])
# Create benchmark suite
suite = BenchmarkSuite(corpus)
# Create test datasets
in_dist = suite.create_in_distribution_test(num_samples=100)
ood_case = suite.create_ood_test(['case'], num_samples=100)
# Compare
results = suite.compare_models(
models={"Vanilla": vanilla, "Recursive": recursive},
test_datasets={"In-Dist": in_dist, "OOD": ood_case}
)
# Print results
from infinigram.evaluation import print_comparison_table
print_comparison_table(results)
Test Coverage¶
The evaluation framework has comprehensive test coverage:
- 20 evaluation tests (test_evaluation.py)
- 33 scoring tests (test_scoring.py)
- 43 recursive tests (test_recursive.py + others)
- Total: 96 tests passing
Future Work¶
Potential improvements:
- Faster transformations: Cache corpus-guided transformations
- Better scoring: Learn weights from validation data
- More transformers: Stemming, lemmatization, abbreviations
- Beam pruning: Prune low-scoring transformations early
- Parallel transformations: Try transformations in parallel
- Context-aware synonyms: Use context to disambiguate word senses