active library

maph

Space-efficient approximate mappings using perfect hash functions. Supports arbitrary function approximation (X→Y) with configurable storage (8/16/32/64-bit) and accuracy trade-offs.

Started 2019 C++

Resources & Distribution

Package Registries

Publications

maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage

(2025)

maph - Memory-Mapped Approximate Perfect Hash

C++17 License: MIT

A high-performance, memory-mapped database and generalized framework for space-efficient approximate data structures with O(1) lookups. Built on perfect hash functions, maph provides sub-microsecond access to arbitrary mappings (f: X → Y) with configurable storage and custom decoders.

Key Features

  • O(1) Lookups: Guaranteed constant-time operations with perfect hash optimization
  • Memory-Mapped Storage: Zero-copy access via mmap for minimal memory overhead
  • Lock-Free Operations: Atomic operations for thread-safe concurrent access
  • Generalized Framework: Support for arbitrary mappings X → Y, not just membership
  • JSON Database: Built-in daemon for JSON key-value storage with REST API
  • Configurable Storage: 8/16/32/64-bit storage options for space/accuracy trade-offs
  • Custom Decoders: Extensible decoder system for specialized applications
  • SIMD Optimized: AVX2 acceleration for batch operations

Performance

OperationThroughputLatencyNotes
GET10M ops/sec<100nsO(1) with perfect hash
SET8M ops/sec<150nsLock-free atomic updates
Batch GET50M ops/sec-SIMD optimized
Batch SET40M ops/sec-Parallel processing
Scan100M items/sec-Sequential memory access

Quick Start

C++ Library

#include <maph.hpp>

// Create a new maph store
auto store = maph::Maph::create("mystore.maph", 1000000);

// Store JSON key-value pairs
store->set(R"({"user": "alice"})", R"({"score": 95})");

// Retrieve with O(1) lookup
auto value = store->get(R"({"user": "alice"})");
if (value) {
    std::cout << *value << std::endl;  // {"score": 95}
}

// Batch operations for high throughput
std::vector<std::pair<JsonView, JsonView>> batch = {
    {R"({"id": 1})", R"({"name": "item1"})"},
    {R"({"id": 2})", R"({"name": "item2"})"}
};
store->mset(batch);

JSON Database Server

# Start the maph daemon
./maph_daemon --port 8080 --threads 4

# CLI operations
./maph_cli set mystore '{"key": "user:1"}' '{"name": "Alice", "age": 30}'
./maph_cli get mystore '{"key": "user:1"}'
./maph_cli scan mystore --limit 100

# REST API
curl -X POST http://localhost:8080/stores -d '{"name": "products", "slots": 1000000}'
curl -X PUT 'http://localhost:8080/stores/products/keys/"sku:12345"' \
  -d '{"name": "Laptop", "price": 999}'
curl 'http://localhost:8080/stores/products/keys/"sku:12345"'

# Optimize for perfect hash (O(1) guaranteed)
curl -X POST http://localhost:8080/stores/products/optimize

Architecture

The system is built in layers:

  1. Core Library (include/maph.hpp): Memory-mapped storage with atomic operations
  2. Approximate Maps (include/maph/approximate_map.hpp): Generalized X→Y mappings
  3. Decoders (include/maph/decoders.hpp): Configurable output transformations
  4. Applications:
    • CLI tool (maph_cli): Command-line interface
    • Daemon (maph_daemon): Standalone JSON database server
    • REST API (integrations/rest_api): HTTP interface with web UI

Storage Layout

File Structure:
┌────────────────┐
│ Header (512B)  │  Magic, version, slot counts, generation
├────────────────┤
│ Slot 0 (512B) │  Hash (32b) + Version (32b) + Data (504B)
├────────────────┤
│ Slot 1 (512B) │  
├────────────────┤
│      ...       │
└────────────────┘

Building

# Clone the repository
git clone https://github.com/yourusername/maph.git
cd maph

# Build everything
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DBUILD_TESTS=ON \
         -DBUILD_REST_API=ON \
         -DBUILD_EXAMPLES=ON
make -j$(nproc)

# Run tests
ctest --verbose

# Install
sudo make install

CMake Options

OptionDefaultDescription
BUILD_TESTSOFFBuild test suite
BUILD_EXAMPLESOFFBuild example programs
BUILD_REST_APIOFFBuild REST API server
BUILD_DOCSOFFGenerate documentation

Advanced Usage

Custom Decoders

Create specialized mappings for your use case:

// Threshold filter - maps values to boolean based on threshold
template <typename StorageType>
struct ThresholdDecoder {
    using output_type = bool;
    float threshold;
    
    bool operator()(StorageType value, StorageType max_val) const {
        return (float(value) / max_val) > threshold;
    }
};

// Quantization decoder - maps to discrete levels
template <typename StorageType>
struct QuantizeDecoder {
    using output_type = int;
    int levels;
    
    int operator()(StorageType value, StorageType max_val) const {
        return (value * levels) / (max_val + 1);
    }
};

Parallel Operations

// Parallel batch processing with custom thread count
store->parallel_mset(large_batch, 8);  // Use 8 threads

// Parallel scan with visitor pattern
std::atomic<size_t> count{0};
store->parallel_scan([&count](uint64_t idx, uint32_t hash, JsonView value) {
    if (value.size() > 100) {
        count++;
    }
}, 4);  // Use 4 threads

Durability Management

// Configure background sync for durability
auto manager = store->start_durability_manager(
    std::chrono::seconds(5),  // Sync every 5 seconds
    1000                       // Or after 1000 operations
);

// Manual sync
store->sync();

Storage Trade-offs

Storage TypeBytes/ElementFalse Positive RateUse Case
uint8_t10.39%Large datasets, error-tolerant
uint16_t20.0015%Balanced accuracy/space
uint32_t4~0%High accuracy required
uint64_t80%Perfect accuracy, cryptographic

Comparison with Other Systems

SystemLookupSpacePersistenceConcurrentUse Case
maphO(1)OptimalmmapLock-freeGeneral K/V, high-perf
RedisO(1) avg10x overheadAOF/RDBSingle-threadCache, sessions
RocksDBO(log n)CompressedSST filesMulti-threadLarge persistent data
MemcachedO(1) avgIn-memoryNoneMulti-threadSimple cache
Bloom filterO(1)OptimalCustomRead-onlyMembership only

Use Cases

  • High-Performance Cache: Sub-microsecond lookups for hot data
  • Feature Stores: ML feature serving with guaranteed latency
  • Session Storage: Web session data with instant access
  • Approximate Filters: Bloom filter alternative with configurable accuracy
  • Rate Limiting: Token bucket implementation with O(1) checks
  • Deduplication: Fast duplicate detection in data streams
  • Graph Databases: Adjacency list storage with perfect hashing
  • Time-Series Index: Fast timestamp to data mapping

API Documentation

Core API

// Create/Open operations
static std::unique_ptr<Maph> create(path, total_slots, static_slots = 0);
static std::unique_ptr<Maph> open(path, readonly = false);

// Basic operations
std::optional<JsonView> get(JsonView key);
bool set(JsonView key, JsonView value);
bool remove(JsonView key);
bool exists(JsonView key);

// Batch operations
void mget(keys, callback);
size_t mset(key_value_pairs);

// Parallel operations
void parallel_mget(keys, callback, threads = 0);
size_t parallel_mset(pairs, threads = 0);
void parallel_scan(visitor, threads = 0);

// Utilities
Stats stats();
void sync();
void close();

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Key areas for contribution:

  • Perfect hash function implementations
  • Additional decoders for specialized use cases
  • Language bindings (Python, Rust, Go, Java)
  • Performance optimizations
  • Documentation and examples

Publications

The theoretical foundation for this work:

@inproceedings{rdphfilter2024,
  title={Rate-Distorted Perfect Hash Filters},
  author={...},
  booktitle={...},
  year={2024}
}

License

MIT License - see LICENSE for details.

Acknowledgments

Built on research in perfect hash functions, approximate data structures, and space-efficient algorithms. Special thanks to the authors of CHD, BBHash, and other minimal perfect hash algorithms that inspired this work.

Related Resources

Explore related blog posts, projects, and publications

Publications

maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage

(2025)

Discussion