Skip to main content

Alga: Algebraic Text Processing with Fuzzy Matching

Alga is a C++20 header-only library that transforms text manipulation from imperative string processing into declarative mathematical expressions. Built on rigorous algebraic foundations with monoids, functors, and extended operators, it provides compositional parsing with built-in fuzzy matching.

The Core Idea

Instead of treating strings as mutable buffers, Alga treats text as elements of algebraic structures:

#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"

using namespace alga;

auto word1 = make_lc_alpha("hello");
auto word2 = make_lc_alpha("world");

if (word1 && word2) {
    // Monoid concatenation
    auto combined = *word1 * *word2;  // "helloworld"

    // Repetition
    auto emphasis = *word1 ^ 3;       // "hellohellohello"

    // Sequential composition (produces vector)
    auto sequence = *word1 >> *word2; // vector["hello", "world"]

    // Porter2 stemming with algebraic composition
    auto stem = make_porter2_stem("running");
    if (stem) {
        auto repeated = *stem ^ 2;    // "runrun"
    }
}

Algebraic Operators

Alga provides mathematically-grounded operators:

OperatorMeaningExample
*Monoid concatenation*word1 * *word2
|Choice (first valid)word1 | word2
^Repetition (n times)*word ^ 3
>>Sequential (to vector)*word1 >> *word2

List Combinators

Parse separated lists and sequences:

#include "parsers/list_combinators.hpp"

// CSV parsing
auto csv = sepBy(int_parser(), char_parser(','));
auto [pos, nums] = csv.parse("1,2,3");  // vector<int>{1, 2, 3}

// One or more items (fails on empty)
auto csv1 = sepBy1(word_parser(), char_parser(','));

// Optional trailing separator
auto items = sepEndBy(word_parser(), char_parser(';'));

Fuzzy Matching

Parse noisy, imperfect input with built-in fuzzy matching:

#include "parsers/fuzzy_parsers.hpp"
#include "parsers/similarity.hpp"

using namespace alga::fuzzy;
using namespace alga::similarity;

// Accept "hello" with up to 2 typos
auto greeting = fuzzy_match("hello", 2);
greeting.parse("helo");    // ✓ Matches (1 edit)
greeting.parse("heello");  // ✓ Matches (1 edit)
greeting.parse("world");   // ✗ Fails (too different)

// Sound-alike name matching
auto name_parser = phonetic_match("Smith");
name_parser.parse("Smyth");  // ✓ Matches (same Soundex)

// Combined fuzzy: case + phonetic + edit distance
auto flexible = combined_fuzzy("Python", 2);
flexible.parse("python");  // ✓ Case-insensitive
flexible.parse("Pyton");   // ✓ Fuzzy match (1 typo)

// String similarity metrics
auto dist = levenshtein_distance("kitten", "sitting");  // → 3
auto sim = jaro_winkler_similarity("Martha", "Marhta"); // → 0.96

Phonetic Algorithms

Sound-alike word matching:

#include "parsers/phonetic.hpp"

auto code1 = soundex("Smith");   // → "S530"
auto code2 = soundex("Smyth");   // → "S530" (same!)

bool alike = sounds_like_soundex("Robert", "Rupert");  // true

Unicode Support

Full UTF-8 with multi-script alphabetic parsing:

#include "parsers/utf8_alpha.hpp"

auto french = make_utf8_alpha("Café");       // → "café"
auto greek = make_utf8_alpha("Αλφα");        // Greek
auto japanese = make_utf8_alpha("こんにちは"); // Hiragana

auto combined = *french * *greek;  // Concatenation works
size_t chars = french->char_count();  // 4 characters, 5 bytes

Numeric Parsers

Type-safe numeric types with algebraic operations:

#include "parsers/numeric_parsers.hpp"

auto num = make_unsigned_int("42");
auto sci = make_scientific_notation("1.5e10");
auto sum = *num * *make_unsigned_int("10");  // Monoid addition

Error Reporting

Rich error diagnostics with position tracking:

#include "parsers/parse_error.hpp"

auto err = ParseError(position, "unexpected character")
    .expect("digit")
    .expect("letter")
    .but_found("!")
    .with_context(tracker.get_context(20, 20));

std::cout << err.format();
// error at line 5, column 12: unexpected character
//   expected: digit, letter
//   found: !

Streaming Support

Memory-efficient large file processing:

#include "parsers/streaming_parser.hpp"

auto parser = from_file("data.txt", word_parser());
parser.parse_by_line([](size_t line_num, const std::string& line, auto result) {
    if (result) {
        process(*result);
    }
});

Statistical Analysis

Text analysis and metrics:

#include "parsers/statistics.hpp"

FrequencyCounter<std::string> counter;
counter.add_all(words);

double entropy = shannon_entropy(counter);
double diversity = simpson_diversity(counter);
auto top10 = counter.top_n(10);

Text Normalization

Text cleaning and standardization:

#include "parsers/normalization.hpp"

auto clean = normalize_text("  HELLO   WORLD  ");  // → "hello world"
auto slug = to_slug("Hello World!");                // → "hello-world"
auto title = to_title_case("hello world");          // → "Hello World"

Mathematical Foundations

Alga implements verified algebraic laws:

// Monoid laws (automatically tested)
ASSERT_EQ((a * b) * c, a * (b * c));  // Associativity
ASSERT_EQ(empty * a, a);               // Left identity
ASSERT_EQ(a * empty, a);               // Right identity

// Functor laws (automatically tested)
ASSERT_EQ(fmap(id, parser), parser);   // Identity preservation

Installation

Header-only—just include the headers you need:

#include "parsers/lc_alpha.hpp"
#include "parsers/fuzzy_parsers.hpp"
// etc.

Or with CMake:

add_subdirectory(alga)
target_link_libraries(your_target PRIVATE alga::alga)

Requirements: C++20 compatible compiler (GCC 10+, Clang 10+, MSVC 2019+).

Resources


Alga: When parsing becomes algebra, complexity becomes composition.

Discussion