Skip to content

Infinigram Benchmarks and Evaluation

Overview

This document presents benchmark results comparing vanilla Infinigram with RecursiveInfinigram on both in-distribution and out-of-distribution (OOD) test scenarios.

Methodology

Corpus

  • Size: 50,021 bytes
  • Type: Synthetic English text
  • Content: Common English words in random sentences

Models Compared

  1. Vanilla Infinigram: Standard n-gram language model with suffix array
  2. RecursiveInfinigram: Enhanced model with transformation-based OOD handling
  3. CaseNormalizer
  4. EditDistanceTransformer (max_distance=1)
  5. SynonymTransformer (WordNet-based)
  6. TransformationScorer (multi-factor weighting)

Test Datasets

  1. In-Distribution (200 samples): Random excerpts from training corpus
  2. OOD: Case Changes (200 samples): Random case flips (30% of letters)
  3. OOD: Typos (200 samples): 1-2 character substitutions
  4. OOD: Case + Typos (200 samples): Combined transformations

Metrics

  • Accuracy: % of correct top-1 predictions
  • Top-k Accuracy: % where correct prediction is in top-k
  • Coverage: % of contexts that have predictions
  • Perplexity: Language model quality metric (lower = better)
  • Mean Probability: Average probability assigned to correct token
  • Mean Rank: Average rank of correct prediction
  • Time: Average prediction time in milliseconds

Results Summary

Comparison Table

Dataset Model Accuracy Top-3 Acc Perplexity Time (ms)
In-Distribution Vanilla 100.0% 100.0% 1.00 1.46
Recursive 100.0% 100.0% 1.00 3.92
OOD: Case Changes Vanilla 52.5% 67.0% 3.05 1.23
Recursive 74.5% 81.5% 1.95 96.12
OOD: Typos Vanilla 76.5% 86.0% 1.59 0.99
Recursive 82.5% 88.5% 1.43 20.44
OOD: Case + Typos Vanilla 40.5% 63.0% 3.85 1.17
Recursive 61.0% 73.0% 2.31 71.33

Improvement Over Vanilla

Dataset Accuracy Δ Perplexity Δ
In-Distribution +0.0% +0.0%
OOD: Case Changes +22.0% +36.2%
OOD: Typos +6.0% +10.2%
OOD: Case + Typos +20.5% +40.0%

Detailed Analysis

In-Distribution Performance

Both models achieve perfect performance (100% accuracy, perplexity 1.00) on in-distribution data, as expected. RecursiveInfinigram is slightly slower (3.92ms vs 1.46ms) due to transformation overhead, but this is negligible in practice.

OOD: Case Changes

Scenario: Input has random case changes (e.g., "ThE QuIcK BroWN")

Results: - Vanilla struggles with case variations (52.5% accuracy) - RecursiveInfinigram handles them well via CaseNormalizer (74.5% accuracy) - 22 percentage point improvement in accuracy - 36% reduction in perplexity

Why it works: CaseNormalizer transforms "ThE QuIcK" → "the quick" before querying, finding matches in the original corpus.

OOD: Typos

Scenario: Input has 1-2 character typos (e.g., "the qwick brown")

Results: - Vanilla has some resilience due to partial matches (76.5% accuracy) - RecursiveInfinigram improves via EditDistanceTransformer (82.5% accuracy) - 6 percentage point improvement in accuracy - 10% reduction in perplexity

Why it works: EditDistanceTransformer detects corpus words within edit distance 1, suggesting "qwick" → "quick" correction.

OOD: Case + Typos (Hardest)

Scenario: Combined case changes and typos (e.g., "ThE qwIcK BroWN")

Results: - Vanilla struggles significantly (40.5% accuracy) - RecursiveInfinigram shows strong robustness (61.0% accuracy) - 20.5 percentage point improvement in accuracy - 40% reduction in perplexity

Why it works: Multiple transformers can chain together: 1. CaseNormalizer: "ThE qwIcK" → "the qwick" 2. EditDistanceTransformer: "the qwick" → "the quick" 3. Result: Successfully recovers original context

Performance Trade-offs

Speed

RecursiveInfinigram is slower than vanilla due to transformation overhead:

  • In-Distribution: 2.7x slower (3.92ms vs 1.46ms)
  • OOD: Case Changes: 78x slower (96.12ms vs 1.23ms)
  • OOD: Typos: 21x slower (20.44ms vs 0.99ms)
  • OOD: Case + Typos: 61x slower (71.33ms vs 1.17ms)

Why: RecursiveInfinigram tries multiple transformations, each requiring: - Transformation generation (corpus inspection) - New suffix array queries - Scoring and ranking

Mitigation strategies: - Cache transformation results - Limit transformation depth - Use conservative scorer when corpus coverage is high - Prune low-scoring transformations early

Memory

RecursiveInfinigram has similar memory footprint to vanilla: - Same suffix array - Additional: transformation cache (~KB) - Additional: scorer state (~bytes)

When to Use Each Model

Use Vanilla Infinigram When:

  • Input is in-distribution (matches training corpus well)
  • Speed is critical
  • Corpus coverage is very high
  • You don't expect typos, case variations, or paraphrasing

Use RecursiveInfinigram When:

  • Input may have typos or spelling errors
  • Input may have case variations
  • Input may use synonyms not in corpus
  • Robustness is more important than speed
  • Working with OOD data (new domains, user-generated content)

Scoring System Impact

The TransformationScorer uses multi-factor weighting:

  1. Match Length (40%): Longer suffix matches are better
  2. Match Frequency (20%): More common patterns are more confident
  3. Transformation Quality (30%): Different transformers have different reliability
  4. Depth Penalty (10%): Fewer transformations are better

Transformer Reliability Weights

  • Case normalization: 0.99 (very reliable)
  • Typo correction: 0.95 (reliable)
  • Synonym replacement: 0.85 (less reliable, meaning can shift)

This weighting ensures that: - Original context (no transformations) scores highest when it has good matches - Case-normalized contexts score almost as high as originals - Synonym-based transformations are used cautiously

Conclusion

RecursiveInfinigram demonstrates significant improvements on out-of-distribution data:

  • +22% accuracy on case variations
  • +6% accuracy on typos
  • +20.5% accuracy on combined challenges
  • Up to 40% perplexity reduction on hard OOD scenarios

The trade-off is higher latency (2-78x slower), but this is acceptable for many applications where robustness is critical.

The transformation-based approach successfully handles: - ✅ Case variations - ✅ Typos and spelling errors - ✅ Synonyms (via WordNet) - ✅ Multiple simultaneous transformations

Running Benchmarks

To reproduce these results:

python3 examples/run_benchmark.py

To run on your own corpus:

from infinigram.evaluation import BenchmarkSuite
from infinigram.infinigram import Infinigram
from infinigram.recursive import RecursiveInfinigram, CaseNormalizer

# Your corpus
corpus = b"your training data here"

# Create models
vanilla = Infinigram(corpus)
recursive = RecursiveInfinigram(corpus, transformers=[CaseNormalizer()])

# Create benchmark suite
suite = BenchmarkSuite(corpus)

# Create test datasets
in_dist = suite.create_in_distribution_test(num_samples=100)
ood_case = suite.create_ood_test(['case'], num_samples=100)

# Compare
results = suite.compare_models(
    models={"Vanilla": vanilla, "Recursive": recursive},
    test_datasets={"In-Dist": in_dist, "OOD": ood_case}
)

# Print results
from infinigram.evaluation import print_comparison_table
print_comparison_table(results)

Test Coverage

The evaluation framework has comprehensive test coverage:

  • 20 evaluation tests (test_evaluation.py)
  • 33 scoring tests (test_scoring.py)
  • 43 recursive tests (test_recursive.py + others)
  • Total: 96 tests passing

Future Work

Potential improvements:

  1. Faster transformations: Cache corpus-guided transformations
  2. Better scoring: Learn weights from validation data
  3. More transformers: Stemming, lemmatization, abbreviations
  4. Beam pruning: Prune low-scoring transformations early
  5. Parallel transformations: Try transformations in parallel
  6. Context-aware synonyms: Use context to disambiguate word senses