Alga is a C++20 header-only library that treats text manipulation as algebra instead of imperative string hacking. It is built on monoids, functors, and extended operators, and it gives you compositional parsing with built-in fuzzy matching.
The Core Idea
Instead of treating strings as mutable buffers, Alga treats text as elements of algebraic structures:
#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"
using namespace alga;
auto word1 = make_lc_alpha("hello");
auto word2 = make_lc_alpha("world");
if (word1 && word2) {
// Monoid concatenation
auto combined = *word1 * *word2; // "helloworld"
// Repetition
auto emphasis = *word1 ^ 3; // "hellohellohello"
// Sequential composition (produces vector)
auto sequence = *word1 >> *word2; // vector["hello", "world"]
// Porter2 stemming with algebraic composition
auto stem = make_porter2_stem("running");
if (stem) {
auto repeated = *stem ^ 2; // "runrun"
}
}
The operators are not arbitrary overloads. They follow actual algebraic laws (associativity, identity, etc.), which means you can reason about compositions the same way you reason about mathematical expressions.
Algebraic Operators
| Operator | Meaning | Example |
|---|---|---|
* | Monoid concatenation | *word1 * *word2 |
| | Choice (first valid) | word1 | word2 |
^ | Repetition (n times) | *word ^ 3 |
>> | Sequential (to vector) | *word1 >> *word2 |
List Combinators
Parse separated lists and sequences:
#include "parsers/list_combinators.hpp"
// CSV parsing
auto csv = sepBy(int_parser(), char_parser(','));
auto [pos, nums] = csv.parse("1,2,3"); // vector<int>{1, 2, 3}
// One or more items (fails on empty)
auto csv1 = sepBy1(word_parser(), char_parser(','));
// Optional trailing separator
auto items = sepEndBy(word_parser(), char_parser(';'));
If you have used Haskell’s parsec or Megaparsec, the combinator style will feel familiar. The difference is that Alga’s combinators carry algebraic guarantees through the type system.
Fuzzy Matching
Parse noisy, imperfect input with built-in fuzzy matching:
#include "parsers/fuzzy_parsers.hpp"
#include "parsers/similarity.hpp"
using namespace alga::fuzzy;
using namespace alga::similarity;
// Accept "hello" with up to 2 typos
auto greeting = fuzzy_match("hello", 2);
greeting.parse("helo"); // Matches (1 edit)
greeting.parse("heello"); // Matches (1 edit)
greeting.parse("world"); // Fails (too different)
// Sound-alike name matching
auto name_parser = phonetic_match("Smith");
name_parser.parse("Smyth"); // Matches (same Soundex)
// Combined fuzzy: case + phonetic + edit distance
auto flexible = combined_fuzzy("Python", 2);
flexible.parse("python"); // Case-insensitive
flexible.parse("Pyton"); // Fuzzy match (1 typo)
// String similarity metrics
auto dist = levenshtein_distance("kitten", "sitting"); // 3
auto sim = jaro_winkler_similarity("Martha", "Marhta"); // 0.96
This is the part I find most useful in practice. Real-world text is messy, and having fuzzy matching baked into the parser combinator framework means you do not have to bolt it on as an afterthought.
Phonetic Algorithms
Sound-alike word matching:
#include "parsers/phonetic.hpp"
auto code1 = soundex("Smith"); // "S530"
auto code2 = soundex("Smyth"); // "S530" (same!)
bool alike = sounds_like_soundex("Robert", "Rupert"); // true
Unicode Support
Full UTF-8 with multi-script alphabetic parsing:
#include "parsers/utf8_alpha.hpp"
auto french = make_utf8_alpha("Cafe"); // "cafe"
auto greek = make_utf8_alpha("Αλφα"); // Greek
auto japanese = make_utf8_alpha("こんにちは"); // Hiragana
auto combined = *french * *greek; // Concatenation works
size_t chars = french->char_count(); // 4 characters, 5 bytes
Numeric Parsers
Type-safe numeric types with algebraic operations:
#include "parsers/numeric_parsers.hpp"
auto num = make_unsigned_int("42");
auto sci = make_scientific_notation("1.5e10");
auto sum = *num * *make_unsigned_int("10"); // Monoid addition
Error Reporting
Rich error diagnostics with position tracking:
#include "parsers/parse_error.hpp"
auto err = ParseError(position, "unexpected character")
.expect("digit")
.expect("letter")
.but_found("!")
.with_context(tracker.get_context(20, 20));
std::cout << err.format();
// error at line 5, column 12: unexpected character
// expected: digit, letter
// found: !
Streaming Support
Memory-efficient large file processing:
#include "parsers/streaming_parser.hpp"
auto parser = from_file("data.txt", word_parser());
parser.parse_by_line([](size_t line_num, const std::string& line, auto result) {
if (result) {
process(*result);
}
});
Statistical Analysis
Text analysis and metrics:
#include "parsers/statistics.hpp"
FrequencyCounter<std::string> counter;
counter.add_all(words);
double entropy = shannon_entropy(counter);
double diversity = simpson_diversity(counter);
auto top10 = counter.top_n(10);
Text Normalization
Text cleaning and standardization:
#include "parsers/normalization.hpp"
auto clean = normalize_text(" HELLO WORLD "); // "hello world"
auto slug = to_slug("Hello World!"); // "hello-world"
auto title = to_title_case("hello world"); // "Hello World"
Mathematical Foundations
Alga implements verified algebraic laws:
// Monoid laws (automatically tested)
ASSERT_EQ((a * b) * c, a * (b * c)); // Associativity
ASSERT_EQ(empty * a, a); // Left identity
ASSERT_EQ(a * empty, a); // Right identity
// Functor laws (automatically tested)
ASSERT_EQ(fmap(id, parser), parser); // Identity preservation
The test suite verifies these laws hold, so you can trust the algebraic reasoning.
Installation
Header-only. Just include the headers you need:
#include "parsers/lc_alpha.hpp"
#include "parsers/fuzzy_parsers.hpp"
// etc.
Or with CMake:
add_subdirectory(alga)
target_link_libraries(your_target PRIVATE alga::alga)
Requirements: C++20 compatible compiler (GCC 10+, Clang 10+, MSVC 2019+).
Resources
- GitHub: github.com/queelius/alga
- Tests: 425 tests, 100% passing
Discussion