Test Strategy Review: Infinigram RecursiveInfinigram System¶

Date: 2025-10-22 Modules Reviewed: infinigram/recursive.py, infinigram/scoring.py, infinigram/evaluation.py Current Test Coverage: - scoring.py: 100% (82/82 statements) ✅ - evaluation.py: 93% (186/201 statements) ✅ - recursive.py: 41% (110/271 statements) ⚠️

Executive Summary¶

The test suite demonstrates excellent behavioral testing for the scoring and evaluation components (100% and 93% coverage respectively), with well-structured tests that focus on contracts rather than implementation. However, the recursive transformation system has significant coverage gaps (41%), particularly around the core transformation generation logic and edge case handling.

Key Strengths¶

Scoring tests are exemplary - Full coverage with focused, behavioral tests
Strong property-based assertions - Tests verify mathematical properties (monotonicity, ranges, etc.)
Good separation of concerns - Component tests are isolated and clear
Minimal implementation coupling - Tests focus on observable behaviors

Critical Gaps¶

SynonymTransformer core logic untested - Only 41% coverage in recursive.py
EditDistanceTransformer transformation generation - Core algorithm not exercised
Edge cases in prediction combining - Empty predictions, weight normalization
Integration paths incomplete - Transformer → Scorer → Predictor flow not fully tested
Error handling paths - Unicode errors, corpus size edge cases

Detailed Analysis by Module¶

1. Scoring Module (`infinigram/scoring.py`) - 100% Coverage ✅¶

Test Quality: EXCELLENT¶

What's Working Well: - Comprehensive component testing - Each scoring component tested in isolation - Mathematical properties verified - Ranges, monotonicity, scaling behavior - Edge cases covered - Zero values, empty inputs, boundary conditions - Factory pattern tested - Default, conservative, aggressive scorer variants - Adaptive scorer - Performance tracking and analysis tested

Test Structure:

TestTransformationScorer (10 tests)
├── Behavioral properties (score ranges, ordering)
├── Component interactions (combining scores)
└── Edge cases (zero length, empty matches)

TestMatchLengthScoring (4 tests)
TestMatchFrequencyScoring (4 tests)
TestTransformationQualityScoring (5 tests)
TestDepthScoring (4 tests)
TestAdaptiveScorer (3 tests)
TestScorerFactories (4 tests)

Excellent Examples:

def test_longer_match_higher_score(self):
    """Longer matches should score higher."""
    # Tests the BEHAVIOR: longer matches → higher scores
    # NOT testing HOW the score is calculated
    assert score_long > score_short

def test_sqrt_scaling(self):
    """Should use sqrt for diminishing returns."""
    # Tests a CONTRACT: the scaling function must be sqrt
    # This is a specification, not implementation detail
    score = scorer._score_match_length(match_length=50, context_length=100)
    expected = math.sqrt(0.5)
    assert abs(score - expected) < 1e-6

No Gaps Identified - This module's tests are a model for the rest of the codebase.

2. Evaluation Module (`infinigram/evaluation.py`) - 93% Coverage ✅¶

Test Quality: VERY GOOD¶

What's Working Well: - End-to-end evaluation flow tested - Evaluator, BenchmarkSuite, metrics - Metrics calculations verified - Accuracy, coverage, perplexity, ranks - Test data generation tested - In-distribution and OOD creation - Model comparison framework - Multi-model, multi-dataset testing - Practical integration - RecursiveInfinigram vs vanilla Infinigram comparison

Test Structure:

TestEvaluator (5 tests) - Core evaluation logic
TestBenchmarkSuite (7 tests) - Test generation and comparison
TestSyntheticCorpus (2 tests) - Corpus generation
TestTransformations (3 tests) - OOD transformations
TestMetrics (2 tests) - Metric calculations
TestPrintComparisonTable (1 test) - Output formatting

Missing Coverage (14 lines, 7%):¶

Lines 97, 112, 131, 138-142 - Verbose progress printing:

if verbose and i % 100 == 0:
    print(f"Evaluating {i}/{len(test_data)}...")  # Line 97
# ... more verbose prints at 112, 131, 138-142

Impact: Low - These are logging statements, not critical logic Recommendation: Add one test with verbose=True or mark as excluded from coverage

Lines 204-205 - Edge case in perplexity calculation:

else:
    perplexity = float('inf')
    mean_probability = 0.0

Impact: Medium - This handles the case where NO predictions have probability > 0 Recommendation: Add test case with model that never returns predictions

Lines 392, 399, 405-407 - Verbose comparison printing:

if verbose:
    print(f"\nEvaluating {model_name}...")
    # ... more verbose prints

Impact: Low - Logging only Recommendation: Test with verbose=True or exclude

3. Recursive Module (`infinigram/recursive.py`) - 41% Coverage ⚠️¶

Test Quality: NEEDS SIGNIFICANT IMPROVEMENT¶

What's Working Well: - Basic initialization tested - RecursiveInfinigram constructor - Cycle detection tested - Prevents infinite loops - Max depth limiting tested - Recursion bounds respected - Case normalizer tested - Simple transformer works

Critical Gaps (159/271 lines untested):

Gap 1: SynonymTransformer Core Logic (Lines 70-140, 146-184)¶

Untested: - generate_transformations() - The main transformation generation logic - Corpus inspection at match positions - Word tokenization and comparison - Synonym detection via WordNet - Transformation deduplication - Word replacement in context

Impact: HIGH - This is core OOD handling functionality

Current Test Limitation:

def test_edit_distance_transformer(self):
    transformations = transformer.generate_transformations(...)
    # Should work without errors
    assert isinstance(transformations, list)  # Too weak!

What's Missing:

name="__codelineno-7-1" href="#__codelineno-7-1"># MISSING: Test actual transformation generation class="k">def test_synonym_transformer_generates_transformations(self): class="w"> """Test that synonyms are detected from corpus inspection.""" corpus = b"the feline sat on the mat" transformer = SynonymTransformer() # Context: "the cat sat" (cat → feline is in corpus) context = b"the cat sat" suffix = b"sat" positions = [find_positions_in_corpus(corpus, suffix)] transformations = transformer.generate_transformations( context=context, suffix=suffix, corpus=corpus, match_positions=positions ) # BEHAVIOR: Should generate cat→feline transformation assert len(transformations) > 0 new_context, desc = transformations[0] assert b"feline" in new_context assert "synonym" in desc

Gap 2: EditDistanceTransformer (Lines 284-346)¶

Untested: - Typo detection and correction - Edit distance calculation - Transformation generation from corpus typos

Impact: HIGH - Another core transformation strategy

Gap 3: Prediction Combining Logic (Lines 645-669)¶

Untested: - _combine_predictions() - Weighted combination of multiple predictions - Weight normalization - Handling empty predictions - Combining overlapping byte predictions

Impact: HIGH - This is how recursive predictions are merged

Missing Test:

def test_combine_predictions_with_weights(self):
    """Test weighted combination of predictions."""
    model = RecursiveInfinigram(corpus)

    # Two predictions for same byte with different weights
    weighted_predictions = [
        ({ord('a'): 0.8, ord('b'): 0.2}, 0.7),  # High weight
        ({ord('a'): 0.3, ord('b'): 0.7}, 0.3),  # Low weight
    ]

    combined = model._combine_predictions(weighted_predictions)

    # Should weight towards first prediction
    assert combined[ord('a')] > combined[ord('b')]

    # Should normalize to sum to 1.0
    assert abs(sum(combined.values()) - 1.0) < 1e-6

Gap 4: Edge Cases¶

Untested scenarios: - Empty context (len=0) - No suffix matches found - All transformers return empty lists - Unicode decode errors in transformers - Very deep recursion (depth=10+) - Beam width = 1 (minimal beam) - Corpus smaller than context - Context not in corpus at all

Strategic Recommendations¶

Priority 1: Critical Gaps (Complete First)¶

1.1 SynonymTransformer Full Coverage¶

File: tests/test_recursive.py or create tests/test_transformers.py

class TestSynonymTransformerBehavior:
    """Test SynonymTransformer contract and behavior."""

    def test_generates_transformation_from_corpus_patterns(self):
        """Given corpus with synonym pattern, generates transformation."""
        # Test BEHAVIOR: corpus inspection → transformation generation

    def test_respects_word_boundaries(self):
        """Transformations preserve word boundaries and spacing."""
        # Test BEHAVIOR: whitespace handling is correct

    def test_deduplicates_transformations(self):
        """Multiple matches don't create duplicate transformations."""
        # Test BEHAVIOR: deduplication works

    def test_limits_transformations_per_match(self):
        """Only generates one transformation per match position."""
        # Test BEHAVIOR: prevents explosion

1.2 Prediction Combining Edge Cases¶

def test_combine_predictions_empty_list(self):
    """Empty prediction list returns empty dict."""

def test_combine_predictions_zero_total_weight(self):
    """Handles case where all weights sum to zero."""

def test_combine_predictions_overlapping_bytes(self):
    """Multiple predictions for same byte are correctly weighted."""

1.3 EditDistanceTransformer Coverage¶

class TestEditDistanceTransformerBehavior:
    def test_detects_single_char_typos(self):
        """Detects and corrects single-character substitutions."""

    def test_respects_max_distance(self):
        """Only corrects typos within max_distance."""

    def test_edit_distance_calculation_accuracy(self):
        """Levenshtein distance calculation is correct."""

Priority 2: Integration Tests (Add After P1)¶

2.1 End-to-End Transformation Flow¶

def test_recursive_prediction_with_typo_corpus_mismatch(self):
    """
    Given: Corpus with correct spelling
    When: Context has typo
    Then: RecursiveInfinigram corrects typo and makes good prediction
    """
    corpus = b"the quick brown fox jumps over the lazy dog"
    model = RecursiveInfinigram(corpus)

    # Typo: "quikc" instead of "quick"
    context = b"the quikc brown"
    probs = model.predict(context, max_depth=2)

    # Should predict ' ' (space) after "brown"
    assert ord(' ') in probs
    assert probs[ord(' ')] > 0.5

def test_recursive_prediction_with_synonym_corpus_mismatch(self):
    """Tests synonym transformation enables prediction."""
    # Similar end-to-end test with synonyms

2.2 Scorer Integration¶

def test_scorer_weights_affect_prediction_ranking(self):
    """Conservative vs aggressive scorer changes prediction distribution."""
    corpus = b"test corpus"

    conservative = RecursiveInfinigram(corpus, scorer=create_conservative_scorer())
    aggressive = RecursiveInfinigram(corpus, scorer=create_aggressive_scorer())

    context = b"transformed context"

    probs_conservative = conservative.predict(context)
    probs_aggressive = aggressive.predict(context)

    # Distributions should differ based on scorer
    assert probs_conservative != probs_aggressive

Priority 3: Robustness Tests (Add After P1 & P2)¶

3.1 Error Handling¶

def test_unicode_decode_error_in_synonym_detection(self):
    """Invalid UTF-8 bytes don't crash synonym detection."""

def test_empty_corpus_handling(self):
    """Empty corpus doesn't crash initialization."""

def test_context_longer_than_corpus(self):
    """Context longer than corpus handled gracefully."""

3.2 Performance Edge Cases¶

def test_deep_recursion_performance(self):
    """Very deep recursion doesn't cause stack overflow."""

def test_large_beam_width_manageable(self):
    """Large beam widths don't cause memory explosion."""

Test Organization Assessment¶

Current Structure: GOOD¶

tests/
├── test_recursive.py          # 10 tests - Basic structure only
├── test_scoring.py            # 33 tests - EXCELLENT
├── test_evaluation.py         # 20 tests - VERY GOOD
├── test_wordnet_integration.py    # 14 tests - Failing (numpy issue)
└── test_corpus_guided_transformations.py  # 11 tests - Failing

Recommended Structure:¶

tests/
├── test_recursive.py          # Keep integration tests here
├── test_transformers.py       # NEW: Dedicated transformer tests
│   ├── TestSynonymTransformerBehavior
│   ├── TestEditDistanceTransformerBehavior
│   └── TestCaseNormalizerBehavior
├── test_scoring.py            # Keep as-is (100% coverage)
├── test_evaluation.py         # Keep as-is (93% coverage)
├── test_recursive_integration.py  # NEW: End-to-end workflows
│   ├── TestTypoCorrectionFlow
│   ├── TestSynonymHandlingFlow
│   └── TestScorerIntegration
└── test_edge_cases.py         # NEW: Robustness and error handling

Test Quality Anti-Patterns Found¶

❌ Anti-Pattern 1: Too-Weak Assertions (test_recursive.py)¶

# BAD: Only checks type, not behavior
assert isinstance(transformations, list)

# GOOD: Checks behavior
assert len(transformations) > 0
assert any("synonym" in desc for _, desc in transformations)

❌ Anti-Pattern 2: No Assertions on Core Logic¶

def test_basic_prediction(self, simple_corpus):
    probs = model.predict(context, max_depth=1)
    assert isinstance(probs, dict)
    # May be empty if no matches, that's ok for now  # ← This is a gap!

✅ Anti-Pattern Fixed: Excellent Behavioral Tests (test_scoring.py)¶

# EXCELLENT: Tests observable property
def test_longer_match_higher_score(self):
    score_long = scorer.score(..., match_length=15, ...)
    score_short = scorer.score(..., match_length=5, ...)
    assert score_long > score_short

Specific Test Recommendations¶

New Tests to Add to `test_recursive.py`¶

class TestRecursiveInfinigramPredictionCombining:
    """Test prediction combining logic."""

    def test_combine_empty_predictions_returns_empty(self):
        """Empty prediction list returns empty dict."""
        model = RecursiveInfinigram(b"test corpus")
        result = model._combine_predictions([])
        assert result == {}

    def test_combine_single_prediction_normalizes(self):
        """Single prediction is normalized to sum to 1.0."""
        model = RecursiveInfinigram(b"test corpus")
        weighted_preds = [({65: 0.3, 66: 0.7}, 1.0)]
        result = model._combine_predictions(weighted_preds)
        assert abs(sum(result.values()) - 1.0) < 1e-9

    def test_combine_respects_weights(self):
        """Higher weight predictions contribute more."""
        model = RecursiveInfinigram(b"test corpus")
        weighted_preds = [
            ({65: 1.0}, 0.9),  # High weight for 'A'
            ({66: 1.0}, 0.1),  # Low weight for 'B'
        ]
        result = model._combine_predictions(weighted_preds)
        assert result[65] > result[66]

    def test_combine_overlapping_predictions_sum(self):
        """Overlapping byte predictions are summed."""
        model = RecursiveInfinigram(b"test corpus")
        weighted_preds = [
            ({65: 0.5}, 0.5),
            ({65: 0.8}, 0.5),
        ]
        result = model._combine_predictions(weighted_preds)
        # (0.5*0.5 + 0.8*0.5) / (0.5*0.5 + 0.8*0.5) = 1.0
        assert abs(result[65] - 1.0) < 1e-9


class TestSynonymTransformerCorpusInspection:
    """Test corpus inspection and transformation generation."""

    def test_inspects_corpus_at_match_positions(self):
        """Transformer looks at corpus before match positions."""
        corpus = b"the big cat sat. the large cat stood."
        transformer = SynonymTransformer(use_wordnet=False)  # Avoid nltk

        context = b"the small cat sat"
        suffix = b"sat"

        # Find where "sat" appears in corpus
        positions = [i for i in range(len(corpus)) if corpus[i:i+3] == suffix]

        transformations = transformer.generate_transformations(
            context=context,
            suffix=suffix,
            corpus=corpus,
            match_positions=positions
        )

        # Should inspect corpus and see words differ
        # (Actual synonym detection depends on WordNet)
        assert isinstance(transformations, list)

    def test_preserves_suffix_in_transformation(self):
        """Generated transformation preserves the matched suffix."""
        corpus = b"test suffix match"
        transformer = SynonymTransformer(use_wordnet=False)

        context = b"other suffix"
        suffix = b"suffix"
        positions = [5]  # "suffix" at position 5 in corpus

        transformations = transformer.generate_transformations(
            context=context,
            suffix=suffix,
            corpus=corpus,
            match_positions=positions
        )

        # All transformations should preserve suffix
        for new_context, desc in transformations:
            assert new_context.endswith(suffix)


class TestEditDistanceTransformerCorrectness:
    """Test edit distance transformer produces correct transformations."""

    def test_edit_distance_calculation_is_accurate(self):
        """Levenshtein distance calculation matches expected values."""
        transformer = EditDistanceTransformer()

        # Known edit distances
        assert transformer._edit_distance(b"cat", b"cat") == 0
        assert transformer._edit_distance(b"cat", b"bat") == 1
        assert transformer._edit_distance(b"cat", b"dog") == 3
        assert transformer._edit_distance(b"sitting", b"kitten") == 3

    def test_only_corrects_within_max_distance(self):
        """Respects max_distance parameter."""
        transformer = EditDistanceTransformer(max_distance=1)

        corpus = b"the cat sat on the mat"
        context = b"the dog sat"  # "dog" vs "cat" = distance 3
        suffix = b"sat"
        positions = [8]

        transformations = transformer.generate_transformations(
            context=context,
            suffix=suffix,
            corpus=corpus,
            match_positions=positions
        )

        # Should NOT generate transformation (distance too large)
        # (This depends on the words lining up correctly)
        assert isinstance(transformations, list)

New Tests to Add to `test_evaluation.py`¶

class TestEvaluatorEdgeCases:
    """Test evaluator edge cases."""

    def test_evaluate_with_no_predictions(self):
        """Handles model that never returns predictions."""
        # Create a model that always returns empty dict
        class NoOpModel:
            def predict(self, context, top_k=10):
                return {}

        model = NoOpModel()
        evaluator = Evaluator(model, "NoOp")

        test_data = [(b"test", b"x")]
        metrics, results = evaluator.evaluate(test_data)

        # Coverage should be 0%
        assert metrics.coverage == 0.0
        # Perplexity should be infinity
        assert metrics.perplexity == float('inf')
        # Mean probability should be 0
        assert metrics.mean_probability == 0.0

    def test_evaluate_with_verbose_output(self):
        """Verbose mode prints progress (coverage for logging)."""
        corpus = b"test corpus"
        model = Infinigram(corpus)
        evaluator = Evaluator(model, "Test")

        test_data = [(b"te", b"s")] * 100  # 100 samples

        # This should print progress at multiples of 100
        metrics, results = evaluator.evaluate(test_data, verbose=True)

        assert len(results) == 100

Missing Test Scenarios by Feature¶

RecursiveInfinigram Core Functionality¶

Feature	Current Coverage	Missing Tests
Transformation generation	20%	Corpus inspection logic, word comparison
Synonym detection	0%	WordNet integration, similarity thresholds
Typo correction	10%	Edit distance, max_distance enforcement
Prediction combining	0%	Weight normalization, empty predictions
Beam search	40%	Beam width limiting, scoring cutoff
Cycle detection	100% ✅	None
Max depth	100% ✅	None

Edge Cases and Error Handling¶

Scenario	Tested?	Priority
Empty corpus	❌	High
Empty context	❌	High
No suffix matches	❌	High
Unicode decode errors	❌	Medium
Context longer than corpus	❌	Medium
Very deep recursion (10+)	❌	Low
Large beam width (100+)	❌	Low
Zero probability predictions	❌	Medium

Integration Scenarios¶

Integration Path	Tested?	Priority
Transformer → Scorer → Predictor	❌	High
Conservative vs Aggressive scorer impact	❌	High
Multiple transformations in sequence	❌	Medium
Transformation + prediction explanation	Partial	Medium
Benchmark suite with real OOD data	Partial	Low

Recommendations Summary¶

Immediate Actions (Complete in 1-2 days)¶

Add transformation generation tests (Priority 1.1)
Test SynonymTransformer.generate_transformations() with real examples
Test EditDistanceTransformer.generate_transformations() with typos
Verify corpus inspection logic works correctly
Add prediction combining tests (Priority 1.2)
Test empty predictions, zero weights, normalization
Test overlapping byte predictions are summed correctly
Test weight distribution affects final predictions
Add EditDistanceTransformer unit tests (Priority 1.3)
Test edit distance calculation accuracy
Test max_distance parameter enforcement
Test typo detection and correction logic

Short-term Improvements (Complete in 1 week)¶

Add end-to-end integration tests (Priority 2.1)
Test full typo correction → prediction flow
Test full synonym handling → prediction flow
Verify explanations are generated correctly
Add scorer integration tests (Priority 2.2)
Test conservative vs aggressive scorer impact on predictions
Verify scorer weights affect transformation selection
Fix evaluation.py coverage gaps (Small task)
Add test with verbose=True to cover logging
Add test for "no predictions" edge case (lines 204-205)

Long-term Quality Improvements (Complete in 2-3 weeks)¶

Add robustness tests (Priority 3)
Test error handling (Unicode, empty inputs)
Test performance edge cases (deep recursion, large beam)
Add property-based tests with Hypothesis
Refactor test organization
Create test_transformers.py for dedicated transformer tests
Create test_recursive_integration.py for end-to-end tests
Create test_edge_cases.py for robustness tests

Coverage Goals¶

Target Coverage by Module (3-month timeline)¶

Module	Current	Target	Priority
scoring.py	100% ✅	100%	Maintain
evaluation.py	93%	98%	Low (close logging gaps)
recursive.py	41% ⚠️	85%	HIGH

Lines to Focus On (recursive.py)¶

High Value (Core Logic): - Lines 70-140: SynonymTransformer.generate_transformations() - Lines 284-346: EditDistanceTransformer.generate_transformations() - Lines 645-669: _combine_predictions() - Lines 536-607: _recursive_transform()

Medium Value (Supporting Logic): - Lines 146-184: _are_synonyms() and WordNet integration - Lines 233-266: _replace_word_in_context() - Lines 352-372: _edit_distance()

Lower Value (Helper Methods): - Lines 609-643: _find_best_suffix_match(), _find_all_suffix_matches() - Lines 671-731: predict_with_explanation()

Conclusion¶

The Infinigram test suite demonstrates strong test engineering practices in the scoring and evaluation modules, with behavioral tests that would remain valid even after significant refactoring. The scoring module in particular is an excellent example of TDD done right.

However, the recursive transformation system needs significant test attention. The 41% coverage represents untested core logic that handles OOD generalization - arguably the most important innovation in the system.

Key Action Items: 1. Add 15-20 focused tests for transformer generation logic (P1) 2. Add 5-10 tests for prediction combining (P1) 3. Add 10-15 integration tests for end-to-end flows (P2) 4. Reach 85% coverage on recursive.py within 3 months

The existing test structure is sound and can accommodate these additions with minimal refactoring. The scoring tests provide an excellent template for how to write resilient, behavioral tests.