Transformation Scoring and Evaluation Framework¶

Overview¶

This document describes the transformation scoring system and evaluation framework for Infinigram's OOD generalization features.

Transformation Scoring System¶

Motivation¶

When using predict_search() to beam search over multiple transform combinations, each transformed context may produce different predictions. We need to weight these predictions based on:

How good the suffix match is
How frequently the pattern appears in the corpus
How reliable the transformations are
How many transformations were applied

Architecture¶

TransformationScorer Class¶

Location: infinigram/scoring.py

class TransformationScorer:
    """
    Scores transformed contexts for weighted prediction combining.

    Considers multiple factors:
    - Match length (longer = better)
    - Match frequency (more occurrences = more confident)
    - Transformation depth (fewer transformations = better)
    - Transformation type (some transformers more reliable)
    """

Scoring Components¶

The scorer computes a final score in [0, 1] as a weighted combination of four components:

1. Match Length Score (default weight: 0.4) - Longer matches = more confident predictions - Uses sqrt for diminishing returns

2. Match Frequency Score (default weight: 0.2) - More occurrences = more confident - Uses logarithmic scaling

3. Transformation Quality Score (default weight: 0.3) - Different transformers have different reliability - Current transform reliability weights: - case: 0.99 (case normalization is very safe) - lowercase/uppercase/casefold: 0.99 - strip/normalize_whitespace: 0.99

4. Depth Penalty Score (default weight: 0.1) - Fewer transformations = better (closer to original) - Uses exponential decay

Scorer Presets¶

Three preset configurations are provided:

1. Default Scorer

create_default_scorer()
# Balanced: (0.4, 0.2, 0.3, 0.1)

2. Conservative Scorer

create_conservative_scorer()
# Match-focused: (0.5, 0.2, 0.1, 0.2)

Use when corpus coverage is high.

3. Aggressive Scorer

create_aggressive_scorer()
# Transformation-friendly: (0.3, 0.3, 0.3, 0.1)

Use for OOD scenarios where corpus coverage is low.

Evaluation Framework¶

Architecture¶

Location: infinigram/evaluation.py

The evaluation framework provides tools to: 1. Evaluate models on test data 2. Create in-distribution and OOD test sets 3. Compare multiple models 4. Generate comprehensive metrics

Components¶

1. Evaluator¶

Evaluates a single model on test data.

evaluator = Evaluator(model, model_name="My Model")
metrics, results = evaluator.evaluate(test_data, top_k=10)

2. BenchmarkSuite¶

Creates test datasets and compares models.

suite = BenchmarkSuite(corpus)

# Create test sets
in_dist = suite.create_in_distribution_test(num_samples=200)
ood_case = suite.create_ood_test(['case'], num_samples=200)

# Compare models
vanilla = Infinigram(corpus)
with_transforms = Infinigram(corpus, default_transforms=['lowercase'])

results = suite.compare_models(
    models={"Vanilla": vanilla, "WithTransforms": with_transforms},
    test_datasets={"In-Dist": in_dist, "OOD-Case": ood_case}
)

3. Metrics¶

@dataclass
class EvaluationMetrics:
    # Accuracy metrics
    accuracy: float  # % of correct predictions
    top_k_accuracy: Dict[int, float]  # Top-k accuracy
    mean_rank: float  # Average rank of correct token

    # Coverage metrics
    coverage: float  # % with predictions
    no_match_rate: float  # % with no match

    # Quality metrics
    perplexity: float  # Lower = better
    mean_probability: float  # Avg prob of correct token

    # Performance metrics
    mean_time_ms: float  # Avg prediction time
    total_time_s: float  # Total evaluation time

OOD Test Generation¶

The framework can automatically create OOD test data:

1. Case Variations

ood_case = suite.create_ood_test(['case'], num_samples=100)
# "the quick brown" -> "ThE QuIcK BroWN"

2. Typos (for testing purposes)

ood_typo = suite.create_ood_test(['typo'], num_samples=100)
# "the quick brown" -> "teh qwick brown"

3. Combined

ood_multi = suite.create_ood_test(['case', 'typo'], num_samples=100)

Running Benchmarks¶

from infinigram.infinigram import Infinigram
from infinigram.evaluation import BenchmarkSuite, print_comparison_table

# Create models with different configurations
corpus = b"your training data"
vanilla = Infinigram(corpus)
with_transforms = Infinigram(corpus, default_transforms=['lowercase'])

# Create benchmark suite
suite = BenchmarkSuite(corpus)

# Create test datasets
test_datasets = {
    "In-Dist": suite.create_in_distribution_test(100),
    "OOD-Case": suite.create_ood_test(['case'], 100),
}

# Compare
results = suite.compare_models(
    models={"Vanilla": vanilla, "WithTransforms": with_transforms},
    test_datasets=test_datasets
)

# Print results
print_comparison_table(results)

Test Coverage¶

Scoring Tests¶

Location: tests/test_scoring.py

Score ranges [0, 1]
Longer matches score higher
More frequent patterns score higher
Fewer transformations score higher
Factory functions (default, conservative, aggressive)

Evaluation Tests¶

Location: tests/test_evaluation.py

Evaluator initialization and basic evaluation
Accuracy and coverage calculation
Top-k accuracy and perplexity calculation
In-distribution and OOD test creation
Model comparison

Future Enhancements¶

Planned OOD Features (Deferred)¶

The following features are planned but deferred due to runtime performance concerns:

Synonym transforms: Corpus-guided word replacement
Would require WordNet integration or embedding similarity
Runtime cost could be significant
Typo correction: Edit-distance based transforms
Would need fuzzy suffix arrays or BK-trees for efficiency
Current implementation only for test data generation

Scoring System Improvements¶

Adaptive Weights: Learn optimal weights from validation data
Context-Aware Scoring: Use context length and complexity
Confidence Intervals: Provide uncertainty estimates

Evaluation Framework Improvements¶

Cross-Validation: k-fold evaluation
Statistical Significance: Hypothesis testing
Error Analysis: Categorize and analyze errors
Visualization: Plot ROC curves, confusion matrices

References¶

Source code: infinigram/scoring.py, infinigram/evaluation.py
Tests: tests/test_scoring.py, tests/test_evaluation.py