active library

alga

Started 2021 C++

Resources & Distribution

Source Code

Package Registries

Alga

A mathematically elegant C++20 library for algebraic text processing and compositional parsing with fuzzy matching. Built on rigorous algebraic foundations with monoids, functors, and extended operators, Alga transforms text manipulation from imperative string processing into declarative mathematical expressions.

✨ Key Features

Mathematical Foundation: Built on rigorous algebraic structures (monoids, functors) with verified laws and properties
Fuzzy Parsing: Phonetic matching (Soundex, Metaphone), edit distance, and similarity-based parsing for noisy input
Unicode Support: Full UTF-8 encoding/decoding with multi-script alphabetic parsing (Latin, Greek, Cyrillic, Hebrew, Arabic, CJK)
Numeric Types: Type-safe parsers for integers, floats, and scientific notation with algebraic operations
Rich Error Reporting: Position tracking, context extraction, and detailed error diagnostics
High Performance: Header-only template library with zero-cost abstractions and C++20 concepts
Type Safety: Compile-time guarantees through concepts, strong typing, and mathematical invariants
Algebraic Operators: Extended operator set (*, |, ^, %, >>) for mathematical composition
Streaming Support: Memory-efficient parsing of large files with buffered I/O
Statistical Analysis: Frequency counting, entropy, diversity metrics for text analysis
Comprehensive Testing: 425 tests, 100% passing with property-based testing of algebraic laws
Production Ready: Includes industry-standard Porter2 stemmer and real-world examples

📦 Quick Start

Prerequisites

C++20 compatible compiler (GCC 10+, Clang 10+, MSVC 2019+)
CMake 3.14+
Optional: Google Test for running tests

Installation

git clone https://github.com/yourusername/alga.git
cd alga
mkdir build && cd build
cmake ..
make -j$(nproc)

# Run tests to verify installation
make test

Basic Example

#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"

using namespace alga;

int main() {
    // Algebraic text processing with mathematical elegance
    auto word1 = make_lc_alpha("hello");
    auto word2 = make_lc_alpha("world");

    if (word1 && word2) {
        // Mathematical composition operators
        auto combined = *word1 * *word2;          // Monoid concatenation: "helloworld"
        auto choice = word1 | word2;              // Choice operator: first valid
        auto emphasis = *word1 ^ 3;               // Repetition: "hellohellohello"
        auto sequence = *word1 >> *word2;         // Sequential: vector["hello", "world"]

        // Porter2 stemming with algebraic composition
        auto stem = make_porter2_stem("running");
        if (stem) {
            auto repeated = *stem ^ 2;             // "runrun"
            std::cout << "Stem: " << std::string{*stem} << std::endl;
        }
    }

    return 0;
}

Fuzzy Parsing Example

#include "parsers/fuzzy_parsers.hpp"
#include "parsers/similarity.hpp"

using namespace alga::fuzzy;
using namespace alga::similarity;

int main() {
    // Accept "hello" with up to 2 typos
    auto greeting = fuzzy_match("hello", 2);
    greeting.parse("helo");    // ✓ Matches (1 edit)
    greeting.parse("heello");  // ✓ Matches (1 edit)
    greeting.parse("world");   // ✗ Fails (too different)

    // Sound-alike name matching
    auto name_parser = phonetic_match("Smith");
    name_parser.parse("Smyth");  // ✓ Matches (same Soundex)
    name_parser.parse("Jones");  // ✗ Fails (different sound)

    // Combined fuzzy: case + phonetic + edit distance
    auto flexible = combined_fuzzy("Python", 2);
    flexible.parse("python");  // ✓ Case-insensitive
    flexible.parse("Pyton");   // ✓ Fuzzy match (1 typo)

    // String similarity metrics
    auto dist = levenshtein_distance("kitten", "sitting");  // → 3
    auto sim = jaro_winkler_similarity("Martha", "Marhta"); // → 0.96

    return 0;
}

🎯 Feature Overview

Tier 1: Core Parsing Features (185 tests)

1. Numeric Parser Family (59 tests)

Type-safe numeric types with algebraic operations:

unsigned_int: Non-negative integers with overflow saturation
signed_int: Signed integers with +/- prefix support
floating_point: Decimal numbers with epsilon comparison
scientific_notation: Exponential format (e.g., “1.5e10”)

#include "parsers/numeric_parsers.hpp"

auto num = make_unsigned_int("42");
auto sci = make_scientific_notation("1.5e10");
auto sum = *num * *make_unsigned_int("10");  // Monoid addition

2. List Combinators (26 tests)

Parse separated lists and sequences:

sepBy: Zero or more elements separated by delimiter
sepBy1: One or more (fails on empty)
sepEndBy: Optional trailing separator
endBy: Separator required after each element

#include "parsers/list_combinators.hpp"

// CSV parsing
auto csv = sepBy(int_parser(), char_parser(','));
auto [pos, nums] = csv.parse("1,2,3");  // vector<int>{1,2,3}

3. Error Reporting (48 tests)

Rich error diagnostics with position tracking:

Line/column/offset position tracking
Expected vs. found reporting
Context extraction with visual markers
ParseResult<T> for detailed error info

#include "parsers/parse_error.hpp"

auto err = ParseError(position, "unexpected character")
    .expect("digit")
    .expect("letter")
    .but_found("!")
    .with_context(tracker.get_context(20, 20));

std::cout << err.format();
// error at line 5, column 12: unexpected character
//   expected: digit, letter
//   found: !

4. Unicode Support (52 tests)

UTF-8 aware text processing:

Full UTF-8 encoding/decoding
Multi-script alphabetic support (Latin, Greek, Cyrillic, Hebrew, Arabic, CJK)
Character counting vs. byte counting
Lowercase conversion for international text

#include "parsers/utf8_alpha.hpp"

auto french = make_utf8_alpha("Café");       // → "café"
auto greek = make_utf8_alpha("Αλφα");        // Greek
auto japanese = make_utf8_alpha("こんにちは"); // Hiragana

auto combined = *french * *greek;  // Concatenation works
size_t chars = french->char_count();  // 4 characters, 5 bytes

Tier 2: Convenience Features (119 tests)

1. Optional Combinators (25 tests)

Always-succeeding parsers:

optional: Makes parser always succeed (returns nullopt on failure)
many: Zero or more repetitions
many1: One or more (fails on zero)
skip: Parse but discard result

#include "parsers/optional_combinator.hpp"

auto opt_sign = optional(char_parser('+'));
auto digits = many1(digit_parser());

2. Count & Range Combinators (35 tests)

Precise repetition control:

count(n, p): Exactly n occurrences
between(m, n, p): Between m and n occurrences
atLeast(m, p): At least m occurrences
atMost(n, p): At most n occurrences

#include "parsers/count_combinators.hpp"

auto three_digits = count(3, digit_parser());
auto two_to_four = between(2, 4, letter_parser());

3. Streaming Parser (26 tests)

Memory-efficient large file processing:

BufferedStreamReader: Buffered I/O with look-ahead
by_line(): Line-by-line parsing
by_chunks(): Fixed-size chunk parsing
from_file(): Convenient file parsing

#include "parsers/streaming_parser.hpp"

auto parser = from_file("data.txt", word_parser());
parser.parse_by_line([](size_t line_num, const std::string& line, auto result) {
    if (result) {
        process(*result);
    }
});

4. Statistics Module (33 tests)

Text analysis and metrics:

Frequency counting with FrequencyCounter<T>
Shannon entropy, normalized entropy
Simpson diversity, Gini coefficient
Type-token ratio, hapax/dis legomena
Comprehensive distribution analysis

#include "parsers/statistics.hpp"

FrequencyCounter<std::string> counter;
counter.add_all(words);

double entropy = shannon_entropy(counter);
double diversity = simpson_diversity(counter);
auto top10 = counter.top_n(10);

Tier 3: Fuzzy Parsing (121 tests) 🆕

1. Phonetic Algorithms (27 tests)

Sound-alike word matching:

Soundex encoding (classic algorithm)
Metaphone encoding (more accurate)
Sound-alike comparison

#include "parsers/phonetic.hpp"

auto code1 = soundex("Smith");   // → "S530"
auto code2 = soundex("Smyth");   // → "S530" (same!)

bool alike = sounds_like_soundex("Robert", "Rupert");  // true

2. Similarity Metrics (39 tests)

String distance and similarity:

Levenshtein distance: Edit distance (insertions, deletions, substitutions)
Damerau-Levenshtein: Includes transpositions (better for typos)
Hamming distance: Position-by-position differences
Jaro similarity: Good for short strings
Jaro-Winkler: Favors common prefixes
LCS: Longest Common Subsequence

#include "parsers/similarity.hpp"

auto dist = levenshtein_distance("kitten", "sitting");  // → 3
auto sim = jaro_winkler_similarity("Martha", "Marhta"); // → 0.96

bool similar = are_similar("hello", "hallo", 0.8);
bool close = within_distance("test", "tset", 2);

3. Fuzzy Parser Combinators (36 tests)

Integrate fuzzy matching into grammars:

fuzzy_match(target, max_dist): Accept words within edit distance
phonetic_match(target): Accept sound-alike words
similarity_match(target, threshold): Accept similar words
fuzzy_choice(candidates, max_dist): Match closest candidate
case_insensitive(target): Ignore case
combined_fuzzy(target, max_dist): All strategies combined

#include "parsers/fuzzy_parsers.hpp"

// Tolerant CSV header parser
auto name_col = fuzzy_match("name", 2);
auto email_col = fuzzy_match("email", 2);
auto header = name_col >> char_parser(',') >> email_col;
header.parse("nme,emai");  // ✓ Accepts typos!

// Multi-strategy command parser
auto exit_cmd = combined_fuzzy("exit", 2);
exit_cmd.parse("Exit");    // ✓ Case-insensitive
exit_cmd.parse("exti");    // ✓ Fuzzy match
exit_cmd.parse("exyt");    // ✓ Phonetic match

4. Text Normalization (19 tests)

Text cleaning and standardization:

Case conversion: to_lowercase(), to_uppercase(), to_title_case()
Whitespace: trim(), normalize_whitespace(), remove_whitespace()
Filtering: keep_alpha(), keep_alnum(), remove_punctuation()
Transformation: to_slug(), replace_all(), collapse_repeated()
Comprehensive: normalize_text() (lowercase + whitespace normalization)

#include "parsers/normalization.hpp"

auto clean = normalize_text("  HELLO   WORLD  ");  // → "hello world"
auto slug = to_slug("Hello World!");                // → "hello-world"
auto title = to_title_case("hello world");          // → "Hello World"

🏗️ Architecture

include/parsers/
├── Core Types (Tier 0)
│   ├── lc_alpha.hpp              # Free monoid of lowercase strings
│   ├── porter2stemmer.hpp         # Porter2 stemming algorithm
│   ├── ngram_stemmer.hpp          # N-gram generation
│   ├── word_parser.hpp            # Word tokenization
│   └── fsm_string_rewriter.hpp    # FSM-based transformations
│
├── Tier 1: Core Parsing Features
│   ├── numeric_parsers.hpp        # unsigned_int, signed_int, floating_point, scientific_notation
│   ├── list_combinators.hpp       # sepBy, sepBy1, sepEndBy, endBy
│   ├── parse_error.hpp            # Rich error reporting with position tracking
│   └── utf8_alpha.hpp             # UTF-8 text processing, multi-script support
│
├── Tier 2: Convenience Features
│   ├── optional_combinator.hpp    # optional, many, many1, skip
│   ├── count_combinators.hpp      # count, between, atLeast, atMost
│   ├── streaming_parser.hpp       # BufferedStreamReader, by_line, by_chunks
│   └── statistics.hpp             # FrequencyCounter, entropy, diversity metrics
│
├── Tier 3: Fuzzy Parsing
│   ├── phonetic.hpp               # Soundex, Metaphone encoding
│   ├── similarity.hpp             # Levenshtein, Jaro-Winkler, LCS
│   ├── fuzzy_parsers.hpp          # fuzzy_match, phonetic_match, combined_fuzzy
│   └── normalization.hpp          # Text cleaning and standardization
│
└── Algebraic Operations
    ├── algebraic_operators.hpp    # |, ^, %, >> operators
    ├── combinatorial_parser_fixed.hpp  # Parser combinator framework
    └── monadic_combinators.hpp    # >>=, fmap

test/                              # Comprehensive test suite (425 tests)
├── Tier 1 Tests (185 tests)
│   ├── numeric_parsers_test.cpp   # 59 tests
│   ├── list_combinators_test.cpp  # 26 tests
│   ├── parse_error_test.cpp       # 48 tests
│   └── utf8_alpha_test.cpp        # 52 tests
│
├── Tier 2 Tests (119 tests)
│   ├── optional_combinator_test.cpp  # 25 tests
│   ├── count_combinators_test.cpp    # 35 tests
│   ├── streaming_parser_test.cpp     # 26 tests
│   └── statistics_test.cpp           # 33 tests
│
└── Tier 3 Tests (121 tests)
    ├── phonetic_test.cpp          # 27 tests
    ├── similarity_test.cpp        # 39 tests
    ├── fuzzy_parsers_test.cpp     # 36 tests
    └── normalization_test.cpp     # 19 tests

🔬 Mathematical Foundations

Algebraic Structures

The library implements several key mathematical concepts:

Free Monoids: The set of lowercase alphabetic strings forms a free monoid under concatenation
- Identity: Empty string
- Associativity: (a · b) · c = a · (b · c)
- Closure: Concatenation always produces valid strings

Functors: Structure-preserving transformations between categories

// fmap preserves composition: fmap(g ∘ f) = fmap(g) ∘ fmap(f)
auto parser = make_word_parser();
auto length_parser = fmap(parser, [](const std::string& s) { return s.length(); });

Monadic Composition: Error-handling composition with automatic failure propagation

auto result = make_lc_alpha("running")
    >>= [](const lc_alpha& word) { return stemmer(word); }
    >>= [](const porter2_stem& stem) { return make_unigram(stem); };

Verified Properties

All algebraic structures satisfy their mathematical laws:

// Monoid laws (automatically tested)
ASSERT_EQ((a * b) * c, a * (b * c));           // Associativity
ASSERT_EQ(empty * a, a);                        // Left identity
ASSERT_EQ(a * empty, a);                        // Right identity

// Functor laws (automatically tested)
ASSERT_EQ(fmap(id, parser), parser);            // Identity preservation
ASSERT_EQ(fmap(compose(g,f), parser), fmap(g, fmap(f, parser))); // Composition

📊 Testing and Quality

Comprehensive Test Suite

The library includes 425 tests across all tiers:

# Run all tests
make test

# Individual test suites
./test/numeric_parsers_test      # 59 tests
./test/list_combinators_test     # 26 tests
./test/parse_error_test          # 48 tests
./test/utf8_alpha_test           # 52 tests
./test/optional_combinator_test  # 25 tests
./test/count_combinators_test    # 35 tests
./test/streaming_parser_test     # 26 tests
./test/statistics_test           # 33 tests
./test/phonetic_test             # 27 tests
./test/similarity_test           # 39 tests
./test/fuzzy_parsers_test        # 36 tests
./test/normalization_test        # 19 tests

# Test coverage analysis
make coverage  # Requires CMAKE_BUILD_TYPE=Coverage

Property-Based Testing

Mathematical properties are automatically verified:

// Test algebraic laws for all monoid types
TEST_F(AlgebraicProperties, MonoidLaws) {
    auto a = *make_lc_alpha("hello");
    auto b = *make_lc_alpha("beautiful");
    auto c = *make_lc_alpha("world");

    EXPECT_EQ((a * b) * c, a * (b * c));  // Associativity
    EXPECT_EQ(lc_alpha{} * a, a);         // Left identity
    EXPECT_EQ(a * lc_alpha{}, a);         // Right identity
}

⚡ Performance

Zero-Cost Abstractions: Template metaprogramming eliminates runtime overhead
Memory Efficient: RAII and move semantics, no unnecessary copying
Parallelizable: Value semantics enable safe concurrent processing
Optimized Algorithms: Tuned implementations of classic algorithms
Streaming Support: Process large files without loading into memory

Benchmarks show comparable performance to hand-optimized C code while providing mathematical guarantees and composability.

🤝 Contributing

We welcome contributions that maintain the mathematical rigor and performance characteristics:

Mathematical Correctness: All algebraic structures must satisfy their laws
Type Safety: Use C++20 concepts and static assertions to prevent invalid usage
Performance: Maintain zero-cost abstraction principles
Testing: Include comprehensive tests with property-based verification
Documentation: Provide clear mathematical explanations

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by Haskell’s parser combinator libraries and category theory
Built using modern C++20 best practices and mathematical foundations
Implements classical algorithms (Porter2, Soundex) within a principled algebraic framework
Fuzzy parsing techniques adapted from information retrieval research

Mathematical parsing meets practical performance. This library demonstrates that rigorous mathematical foundations enhance rather than compromise real-world usability and performance—now with fuzzy matching for noisy real-world data!