Infinigram Benchmarks and Evaluation¶

Overview¶

This document presents benchmark results comparing vanilla Infinigram with RecursiveInfinigram on both in-distribution and out-of-distribution (OOD) test scenarios.

Methodology¶

Corpus¶

Size: 50,021 bytes
Type: Synthetic English text
Content: Common English words in random sentences

Models Compared¶

Vanilla Infinigram: Standard n-gram language model with suffix array
RecursiveInfinigram: Enhanced model with transformation-based OOD handling
CaseNormalizer
EditDistanceTransformer (max_distance=1)
SynonymTransformer (WordNet-based)
TransformationScorer (multi-factor weighting)

Test Datasets¶

In-Distribution (200 samples): Random excerpts from training corpus
OOD: Case Changes (200 samples): Random case flips (30% of letters)
OOD: Typos (200 samples): 1-2 character substitutions
OOD: Case + Typos (200 samples): Combined transformations

Metrics¶

Accuracy: % of correct top-1 predictions
Top-k Accuracy: % where correct prediction is in top-k
Coverage: % of contexts that have predictions
Perplexity: Language model quality metric (lower = better)
Mean Probability: Average probability assigned to correct token
Mean Rank: Average rank of correct prediction
Time: Average prediction time in milliseconds

Results Summary¶

Comparison Table¶

Dataset	Model	Accuracy	Top-3 Acc	Perplexity	Time (ms)
In-Distribution	Vanilla	100.0%	100.0%	1.00	1.46
	Recursive	100.0%	100.0%	1.00	3.92
OOD: Case Changes	Vanilla	52.5%	67.0%	3.05	1.23
	Recursive	74.5%	81.5%	1.95	96.12
OOD: Typos	Vanilla	76.5%	86.0%	1.59	0.99
	Recursive	82.5%	88.5%	1.43	20.44
OOD: Case + Typos	Vanilla	40.5%	63.0%	3.85	1.17
	Recursive	61.0%	73.0%	2.31	71.33

Improvement Over Vanilla¶

Dataset	Accuracy Δ	Perplexity Δ
In-Distribution	+0.0%	+0.0%
OOD: Case Changes	+22.0%	+36.2%
OOD: Typos	+6.0%	+10.2%
OOD: Case + Typos	+20.5%	+40.0%

Detailed Analysis¶

In-Distribution Performance¶

Both models achieve perfect performance (100% accuracy, perplexity 1.00) on in-distribution data, as expected. RecursiveInfinigram is slightly slower (3.92ms vs 1.46ms) due to transformation overhead, but this is negligible in practice.

OOD: Case Changes¶

Scenario: Input has random case changes (e.g., "ThE QuIcK BroWN")

Results: - Vanilla struggles with case variations (52.5% accuracy) - RecursiveInfinigram handles them well via CaseNormalizer (74.5% accuracy) - 22 percentage point improvement in accuracy - 36% reduction in perplexity

Why it works: CaseNormalizer transforms "ThE QuIcK" → "the quick" before querying, finding matches in the original corpus.

OOD: Typos¶

Scenario: Input has 1-2 character typos (e.g., "the qwick brown")

Results: - Vanilla has some resilience due to partial matches (76.5% accuracy) - RecursiveInfinigram improves via EditDistanceTransformer (82.5% accuracy) - 6 percentage point improvement in accuracy - 10% reduction in perplexity

Why it works: EditDistanceTransformer detects corpus words within edit distance 1, suggesting "qwick" → "quick" correction.

OOD: Case + Typos (Hardest)¶

Scenario: Combined case changes and typos (e.g., "ThE qwIcK BroWN")

Results: - Vanilla struggles significantly (40.5% accuracy) - RecursiveInfinigram shows strong robustness (61.0% accuracy) - 20.5 percentage point improvement in accuracy - 40% reduction in perplexity

Why it works: Multiple transformers can chain together: 1. CaseNormalizer: "ThE qwIcK" → "the qwick" 2. EditDistanceTransformer: "the qwick" → "the quick" 3. Result: Successfully recovers original context

Performance Trade-offs¶

Speed¶

RecursiveInfinigram is slower than vanilla due to transformation overhead:

In-Distribution: 2.7x slower (3.92ms vs 1.46ms)
OOD: Case Changes: 78x slower (96.12ms vs 1.23ms)
OOD: Typos: 21x slower (20.44ms vs 0.99ms)
OOD: Case + Typos: 61x slower (71.33ms vs 1.17ms)

Why: RecursiveInfinigram tries multiple transformations, each requiring: - Transformation generation (corpus inspection) - New suffix array queries - Scoring and ranking

Mitigation strategies: - Cache transformation results - Limit transformation depth - Use conservative scorer when corpus coverage is high - Prune low-scoring transformations early

Memory¶

RecursiveInfinigram has similar memory footprint to vanilla: - Same suffix array - Additional: transformation cache (~KB) - Additional: scorer state (~bytes)

When to Use Each Model¶

Use Vanilla Infinigram When:¶

Input is in-distribution (matches training corpus well)
Speed is critical
Corpus coverage is very high
You don't expect typos, case variations, or paraphrasing

Use RecursiveInfinigram When:¶

Input may have typos or spelling errors
Input may have case variations
Input may use synonyms not in corpus
Robustness is more important than speed
Working with OOD data (new domains, user-generated content)

Scoring System Impact¶

The TransformationScorer uses multi-factor weighting:

Match Length (40%): Longer suffix matches are better
Match Frequency (20%): More common patterns are more confident
Transformation Quality (30%): Different transformers have different reliability
Depth Penalty (10%): Fewer transformations are better

Transformer Reliability Weights¶

Case normalization: 0.99 (very reliable)
Typo correction: 0.95 (reliable)
Synonym replacement: 0.85 (less reliable, meaning can shift)

This weighting ensures that: - Original context (no transformations) scores highest when it has good matches - Case-normalized contexts score almost as high as originals - Synonym-based transformations are used cautiously

Conclusion¶

RecursiveInfinigram demonstrates significant improvements on out-of-distribution data:

+22% accuracy on case variations
+6% accuracy on typos
+20.5% accuracy on combined challenges
Up to 40% perplexity reduction on hard OOD scenarios

The trade-off is higher latency (2-78x slower), but this is acceptable for many applications where robustness is critical.

The transformation-based approach successfully handles: - ✅ Case variations - ✅ Typos and spelling errors - ✅ Synonyms (via WordNet) - ✅ Multiple simultaneous transformations

Running Benchmarks¶

To reproduce these results:

python3 examples/run_benchmark.py

To run on your own corpus:

from infinigram.evaluation import BenchmarkSuite
from infinigram.infinigram import Infinigram
from infinigram.recursive import RecursiveInfinigram, CaseNormalizer

# Your corpus
corpus = b"your training data here"

# Create models
vanilla = Infinigram(corpus)
recursive = RecursiveInfinigram(corpus, transformers=[CaseNormalizer()])

# Create benchmark suite
suite = BenchmarkSuite(corpus)

# Create test datasets
in_dist = suite.create_in_distribution_test(num_samples=100)
ood_case = suite.create_ood_test(['case'], num_samples=100)

# Compare
results = suite.compare_models(
    models={"Vanilla": vanilla, "Recursive": recursive},
    test_datasets={"In-Dist": in_dist, "OOD": ood_case}
)

# Print results
from infinigram.evaluation import print_comparison_table
print_comparison_table(results)

Test Coverage¶

The evaluation framework has comprehensive test coverage:

20 evaluation tests (test_evaluation.py)
33 scoring tests (test_scoring.py)
43 recursive tests (test_recursive.py + others)
Total: 96 tests passing

Future Work¶

Potential improvements:

Faster transformations: Cache corpus-guided transformations
Better scoring: Learn weights from validation data
More transformers: Stemming, lemmatization, abbreviations
Beam pruning: Prune low-scoring transformations early
Parallel transformations: Try transformations in parallel
Context-aware synonyms: Use context to disambiguate word senses