maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage

June 10, 2024 10 min read Updated: October 7, 2025

maph (Map based on Perfect Hash) is a high-performance key-value storage system that achieves sub-100 nanosecond latency through a novel combination of memory-mapped I/O, approximate perfect hashing, and lock-free atomic operations. Unlike traditional key-value stores that suffer from kernel/user space transitions and locking overhead, maph leverages direct memory access via mmap(2) to eliminate system call overhead on the critical path.

The Performance Problem

Modern distributed systems depend on key-value stores for caching, session management, and metadata storage. However, existing solutions face fundamental limitations when microsecond-level latency becomes critical:

Three fundamental bottlenecks:

Kernel overhead: System calls require expensive context switches (100-500ns per operation)
Memory copying: Traditional stores copy data multiple times—from kernel buffers to user space, between internal structures, for serialization
Synchronization overhead: Lock-based concurrency creates contention and unpredictable latency spikes

These limitations are particularly acute in:

High-frequency trading: Every nanosecond translates to competitive advantage
ML feature stores: Inference pipelines require thousands of features within tight latency budgets
Real-time gaming and IoT: Predictable response times under concurrent access patterns

Core Innovation: Three Key Techniques

1. Zero-Copy Architecture via mmap

By memory-mapping the entire database file, maph eliminates kernel/user space transitions:

// Traditional approach: Multiple copies
std::vector<uint8_t> data = read_from_disk(key);  // Kernel → user copy
Value v = deserialize(data);                      // Another copy
process(v);                                        // Yet another copy

// maph approach: Direct memory access
auto store = maph::Maph::open("mystore.maph");
auto value = store->get(key);  // Zero copies! Direct pointer into mmap region

Key insight: The CPU’s memory management unit (MMU) handles address translation transparently, providing the illusion of in-memory access with automatic persistence through the kernel page cache.

2. Hybrid Hash Architecture

maph uses a hybrid hasher that dynamically combines perfect hashing with standard hashing:

When optimized (via the /optimize REST API endpoint or optimize() C++ method), maph constructs a minimal perfect hash function for all keys present at optimization time:

\[ \forall k_i, k_j \in K_{\text{known}}, i \neq j : h_p(k_i) \neq h_p(k_j) \]

For known keys: The perfect hasher provides O(1) worst-case lookup with exactly one memory access—no probing required.

For new keys (inserted after optimization): The hybrid hasher automatically falls back to standard FNV-1a hashing with linear probing:

\[ \text{slot}_i = (h_s(k) + i) \bmod n, \quad i \in [0, \text{MAX\_PROBES} - 1] \]

The maximum probe distance (default 10) ensures bounded worst-case latency.

Key insight: The same slot array serves both purposes. The hybrid hasher checks if a key was in the original optimized set (is_perfect_for(key)), and uses the appropriate hash function accordingly. No static partitioning is needed.

3. Lock-Free Atomic Operations

All operations use compare-and-swap (CAS) primitives and atomic versioning:

Atomic Slot Versioning: Each slot maintains a 64-bit atomic value combining key hash (high 32 bits) and version number (low 32 bits):

// Lock-free read
uint64_t v1 = slot.hash_version.load(acquire);
uint32_t hash = v1 >> 32;
if (hash != hash(key)) return nullopt;  // Key not found

// Copy data while checking version
data = copy(slot.data);
uint64_t v2 = slot.hash_version.load(acquire);

if (v1 == v2 && (v1 & 1) == 0) {  // Even version = consistent
    return data;
}
// Retry if version changed

Writers use a two-phase protocol:

Increment version to odd (marking slot as updating)
Write new data
Increment version to even (marking update complete)

Readers either see old value or new value completely—never partial updates.

Performance Results

Single-Threaded Latency

Operation	p50	p90	p99	p99.9	p99.99
Random GET	351ns	601ns	972ns	1,342ns	7,844ns
Sequential GET	221ns	301ns	752ns	922ns	8,736ns

Sub-microsecond median latency with predictable tail latency under 10µs.

Multi-Threaded Scaling

Threads	Throughput (M ops/sec)	Avg Latency (ns)	Speedup	Efficiency
1	2.69	347	1.00×	100%
2	5.71	325	2.12×	106%
4	11.90	309	4.42×	110%
8	17.24	375	6.40×	80%
16	98.00	-	36.43×	-

Near-linear scaling up to 4 threads (110% efficiency), demonstrating the effectiveness of lock-free design.

Comparison with Other Systems

System	Throughput	Speedup vs maph
maph	10M ops/sec	1.00× (baseline)
Redis	0.8M ops/sec	12× slower
RocksDB	0.11M ops/sec	87× slower
Memcached	1.6M ops/sec	6× slower

Memory Layout and Slot Design

Fixed-Size Slots

Each slot is 512 bytes, aligned to 64-byte cache lines:

Slot Structure (512 bytes):
┌─────────────────────────────┐
│ Metadata (16 bytes)         │
│  - hash_version (8B atomic) │  Upper 32 bits: hash identifier
│  - size (4B)                │  Lower 32 bits: version counter
│  - reserved (4B)            │
├─────────────────────────────┤
│ Data (496 bytes)            │  Actual value storage
└─────────────────────────────┘

Why 512 bytes?

Cache alignment: 8× 64-byte cache lines prevents false sharing
Typical JSON docs: 100-400 bytes with acceptable overhead
Direct computation: slot_addr = base_addr + slot_index × 512
SIMD friendly: Enables vectorized operations

The fixed size is configurable at compile time:

// Small values
mmap_storage<128> counters;

// Default
mmap_storage<512> json_store;

// Large values
mmap_storage<4096> documents;

Architecture Components

File Structure:
┌────────────────┐
│ Header (512B)  │  Magic, version, slot counts, generation
├────────────────┤
│ Slot 0 (512B)  │  Hash (32b) + Version (32b) + Data (496B)
├────────────────┤
│ Slot 1 (512B)  │
├────────────────┤
│      ...       │
└────────────────┘

SIMD Batch Operations

AVX2 instructions enable parallel processing of 8 keys simultaneously:

void compute_batch_avx2(const char** keys, uint32_t* hashes, size_t count) {
    __m256i fnv_prime = _mm256_set1_epi32(16777619u);
    __m256i fnv_offset = _mm256_set1_epi32(2166136261u);

    for (size_t i = 0; i + 8 <= count; i += 8) {
        __m256i h = fnv_offset;
        size_t min_len = find_min_length(keys + i, 8);

        for (size_t j = 0; j < min_len; ++j) {
            __m256i chars = gather_chars(keys + i, j);
            h = _mm256_xor_si256(h, chars);
            h = _mm256_mullo_epi32(h, fnv_prime);
        }

        _mm256_storeu_si256((__m256i*)(hashes + i), h);
    }
}

Result: ~5× throughput improvement for batch operations, achieving 50M keys/sec for parallel lookups.

Hybrid Deployment Architecture

The mmap-based design enables a powerful pattern:

Pattern 1: Single Writer, Multiple Readers (Recommended)

+----------------------------------+
| Shared mmap Files (data/*.maph)  |
+----------+-------------+---------+
           |             |
     (write)|             |(read-only)
   +--------v---+  +------v----------+
   | REST API   |  | C++ App         |
   | Server     |  | (direct mmap)   |
   +------------+  +-----------------+
        |
   (HTTP)|
   +-----v----------+
   | Remote Clients |
   +----------------+

Direct mmap read access (C++ app):

// Local C++ application: Direct read-only access (~300ns latency)
auto db = maph::Maph::open("/var/lib/maph/data/cache.maph",
                           true);  // readonly
auto value = db->get("hot_key");  // Zero IPC overhead!

Benefits:

Zero-IPC overhead: Local reads bypass HTTP entirely (300ns vs 1-2ms)
5,000× speedup: Direct mmap reads are ~5,000× faster than HTTP requests
Safety: Single writer eliminates race conditions without coordination
Flexibility: Remote clients use HTTP, local apps use direct mmap

Pattern 2: Multi-Writer with Coordination

For multiple C++ writers, explicit synchronization is necessary:

// Acquire exclusive lock before writing
int lock_fd = open("cache.maph.lock", O_CREAT | O_RDWR);
flock(lock_fd, LOCK_EX);  // Exclusive lock (~2µs overhead)

auto db = maph::Maph::open("cache.maph", false);
db->set("key", "value");

flock(lock_fd, LOCK_UN);  // Release lock
close(lock_fd);

Alternative coordination mechanisms:

Application-level tokens (Redis, etcd)
Partitioning (different processes own different key ranges)
Message queue coordination

REST API Server

The provided REST API server demonstrates practical deployment:

# Start daemon
./maph_daemon --port 8080 --threads 4

# CLI operations
./maph_cli set mystore '{"key": "user:1"}' '{"name": "Alice", "age": 30}'
./maph_cli get mystore '{"key": "user:1"}'

# REST API
curl -X POST http://localhost:8080/stores \
  -d '{"name": "products", "slots": 1000000}'

curl -X PUT 'http://localhost:8080/stores/products/keys/"sku:12345"' \
  -d '{"name": "Laptop", "price": 999}'

curl 'http://localhost:8080/stores/products/keys/"sku:12345"'

# Optimize for perfect hash (O(1) guaranteed)
curl -X POST http://localhost:8080/stores/products/optimize

Access Pattern Performance Comparison

Access Pattern	Latency	Use Case
Direct mmap (C++)	300ns	Local high-performance reads
REST API (local)	1-2ms	Remote or cross-language access
REST API (network)	5-10ms	Distributed systems

Hash Function Selection

FNV-1a Hash (Short Keys)

For keys under 16 bytes, FNV-1a provides excellent distribution with minimal overhead:

uint32_t fnv1a(const char* key, size_t len) {
    uint32_t h = 2166136261u;  // offset basis
    for (size_t i = 0; i < len; ++i) {
        h ^= (uint8_t)key[i];
        h *= 16777619u;  // FNV prime
    }
    return h;
}

Performance: 3-4 cycles per byte on modern CPUs.

xxHash (Long Keys)

For longer keys, xxHash processes 32-bit chunks for superior throughput:

Processes 4 bytes per iteration
Exploits instruction-level parallelism
Achieves 13 GB/s on a single core

Durability Management

Asynchronous Durability

While mmap provides automatic persistence, explicit control is available:

class DurabilityManager {
    void sync_thread() {
        while (running) {
            sleep_for(sync_interval);
            msync(mapped_region, region_size, MS_ASYNC);  // Async writeback
        }
    }
};

// Configure background sync
auto manager = store->start_durability_manager(
    std::chrono::seconds(5),  // Sync every 5 seconds
    1000                       // Or after 1000 operations
);

// Manual sync for strong durability
store->sync();  // MS_SYNC forces synchronous write

Copy-on-Write Snapshots

The OS’s copy-on-write mechanism enables efficient point-in-time backups:

Fork the process to create a snapshot
Child process inherits the memory mapping
Pages are copied only when modified
Snapshot remains consistent during export

Parallel Scan Implementation

Table scans partition the slot array across threads with work-stealing:

void parallel_scan(std::function<void(Slot&)> visitor, size_t num_threads) {
    std::atomic<size_t> next_chunk{0};
    constexpr size_t CHUNK_SIZE = 1024;

    std::vector<std::thread> threads;
    for (size_t t = 0; t < num_threads; ++t) {
        threads.emplace_back([&] {
            size_t chunk;
            while ((chunk = next_chunk.fetch_add(1)) < total_slots / CHUNK_SIZE) {
                size_t start = chunk * CHUNK_SIZE;
                size_t end = std::min(start + CHUNK_SIZE, total_slots);

                for (size_t i = start; i < end; ++i) {
                    if (!slots[i].empty()) {
                        visitor(slots[i]);
                    }
                }
            }
        });
    }

    for (auto& t : threads) t.join();
}

Performance: 100M items/sec for sequential memory access.

Quick Start

# Clone and build
git clone https://github.com/queelius/maph.git
cd maph
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DBUILD_TESTS=ON \
         -DBUILD_REST_API=ON
make -j$(nproc)

# Run tests
ctest --verbose

# Install
sudo make install

C++ Library Usage

#include <maph.hpp>

// Create a new maph store
auto store = maph::Maph::create("mystore.maph", 1000000);

// Store JSON key-value pairs
store->set(R"({"user": "alice"})", R"({"score": 95})");

// Retrieve with O(1) lookup
auto value = store->get(R"({"user": "alice"})");
if (value) {
    std::cout << *value << std::endl;  // {"score": 95}
}

// Batch operations for high throughput
std::vector<std::pair<JsonView, JsonView>> batch = {
    {R"({"id": 1})", R"({"name": "item1"})"},
    {R"({"id": 2})", R"({"name": "item2"})"}
};
store->mset(batch);

// Parallel batch processing
store->parallel_mset(large_batch, 8);  // Use 8 threads

Use Cases

High-Performance Cache: Sub-microsecond lookups for hot data
Feature Stores: ML feature serving with guaranteed latency
Session Storage: Web session data with instant access
Rate Limiting: Token bucket implementation with O(1) checks
Deduplication: Fast duplicate detection in data streams
Graph Databases: Adjacency list storage with perfect hashing
Time-Series Index: Fast timestamp to data mapping

Comparison with Other Systems

System	Lookup	Space	Persistence	Concurrent	Use Case
maph	O(1)	Optimal	mmap	Lock-free	General K/V, high-perf
Redis	O(1) avg	10× overhead	AOF/RDB	Single-thread	Cache, sessions
RocksDB	O(log n)	Compressed	SST files	Multi-thread	Large persistent data
Memcached	O(1) avg	In-memory	None	Multi-thread	Simple cache
LMDB	O(log n)	B+tree	mmap	MVCC	Transactional K/V

Design Principles

🚀 Eliminate Kernel Crossings: Every system call incurs 100-200ns overhead

0️⃣ Zero-Copy: Data remains in place from storage to application

⚡ Hardware Parallelism: Leverage ILP (pipelining), SIMD, and multi-core

🎯 Optimize Common Case: Read-heavy workloads with temporal locality

Limitations and Trade-offs

Fixed Slot Size

Each storage instantiation uses a fixed slot size chosen at compile time. Applications with heterogeneous value sizes may experience space overhead. Mitigate by:

Choosing appropriate slot sizes for workload
Using multiple stores with different slot sizes

Dynamic Key Sets

Perfect hash optimization provides O(1) lookups only for keys in the original dataset. New keys use standard hashing with linear probing. Workloads with highly dynamic key sets must:

Provision sufficient slots
Periodically rebuild perfect hash function

Single-Machine Scalability

maph operates on a single machine, limited by available memory and CPU cores. Applications requiring horizontal scaling must implement sharding at the application level.

Memory Overhead

maph uses 2.85× more memory than std::unordered_map due to fixed 512-byte slots versus variable-size allocations. This overhead is the price paid for:

Predictable memory layout
Efficient memory-mapped I/O
Multi-process sharing

Future Directions

Runtime-Adaptive Slot Sizing

Hybrid approach with multiple slot classes (64B, 512B, 4KB) dynamically selected based on value size could reduce waste while maintaining predictability.

Learned Indexing

Machine learning models could replace perfect hash functions, learning the key distribution online for better collision rates with skewed distributions.

Persistent Memory Integration

Intel Optane DC Persistent Memory provides byte-addressable persistence with DRAM-like latency. Adapting maph for persistent memory could eliminate the page cache layer.

Distributed Consensus

Integrating consensus protocols like Raft could enable distributed operation while maintaining consistency, with carefully designed protocols minimizing latency impact.

Resources

Repository: github.com/queelius/maph
Technical Report: docs/maph_technical_report.pdf
REST API: See integrations/rest_api/ directory
Examples: See examples/ directory
Tests: Comprehensive test suite in tests/

License

MIT

maph: Where memory-mapped I/O meets perfect hashing for sub-microsecond key-value storage. Zero-copy architecture through mmap, O(1) guarantees through perfect hashing, and linear scalability through lock-free algorithms.

The Performance Problem

Core Innovation: Three Key Techniques

1. Zero-Copy Architecture via mmap

2. Hybrid Hash Architecture

3. Lock-Free Atomic Operations

Performance Results

Single-Threaded Latency

Multi-Threaded Scaling

Comparison with Other Systems

Memory Layout and Slot Design

Fixed-Size Slots

Architecture Components

SIMD Batch Operations

Hybrid Deployment Architecture

Pattern 1: Single Writer, Multiple Readers (Recommended)

Pattern 2: Multi-Writer with Coordination

REST API Server

Access Pattern Performance Comparison

Hash Function Selection

FNV-1a Hash (Short Keys)

xxHash (Long Keys)

Durability Management

Asynchronous Durability

Copy-on-Write Snapshots

Parallel Scan Implementation

Quick Start

C++ Library Usage

Use Cases

Comparison with Other Systems

Design Principles

Limitations and Trade-offs

Fixed Slot Size

Dynamic Key Sets

Single-Machine Scalability

Memory Overhead

Future Directions

Runtime-Adaptive Slot Sizing

Learned Indexing

Persistent Memory Integration

Distributed Consensus

Resources

License

Related Posts

ZeroIPC: Transforming Shared Memory into an Active Computational Substrate

PFC: Zero-Copy Data Compression Through Prefix-Free Codecs

Accumux: Compositional Online Statistical Reductions in C++

Alga: Algebraic Text Processing with Fuzzy Matching

CBT: Computational Basis Transforms - A Unified Framework

Discussion