Skip to main content

maph: Maps Based on Perfect Hashing for Sub-Microsecond Key-Value Storage

maph (Map based on Perfect Hash) is a high-performance key-value storage system that achieves sub-100 nanosecond latency through a novel combination of memory-mapped I/O, approximate perfect hashing, and lock-free atomic operations. Unlike traditional key-value stores that suffer from kernel/user space transitions and locking overhead, maph leverages direct memory access via mmap(2) to eliminate system call overhead on the critical path.

The Performance Problem

Modern distributed systems depend on key-value stores for caching, session management, and metadata storage. However, existing solutions face fundamental limitations when microsecond-level latency becomes critical:

Three fundamental bottlenecks:

  1. Kernel overhead: System calls require expensive context switches (100-500ns per operation)
  2. Memory copying: Traditional stores copy data multiple times—from kernel buffers to user space, between internal structures, for serialization
  3. Synchronization overhead: Lock-based concurrency creates contention and unpredictable latency spikes

These limitations are particularly acute in:

  • High-frequency trading: Every nanosecond translates to competitive advantage
  • ML feature stores: Inference pipelines require thousands of features within tight latency budgets
  • Real-time gaming and IoT: Predictable response times under concurrent access patterns

Core Innovation: Three Key Techniques

1. Zero-Copy Architecture via mmap

By memory-mapping the entire database file, maph eliminates kernel/user space transitions:

// Traditional approach: Multiple copies
std::vector<uint8_t> data = read_from_disk(key);  // Kernel → user copy
Value v = deserialize(data);                      // Another copy
process(v);                                        // Yet another copy

// maph approach: Direct memory access
auto store = maph::Maph::open("mystore.maph");
auto value = store->get(key);  // Zero copies! Direct pointer into mmap region

Key insight: The CPU’s memory management unit (MMU) handles address translation transparently, providing the illusion of in-memory access with automatic persistence through the kernel page cache.

2. Hybrid Hash Architecture

maph uses a hybrid hasher that dynamically combines perfect hashing with standard hashing:

When optimized (via the /optimize REST API endpoint or optimize() C++ method), maph constructs a minimal perfect hash function for all keys present at optimization time:

\[ \forall k_i, k_j \in K_{\text{known}}, i \neq j : h_p(k_i) \neq h_p(k_j) \]

For known keys: The perfect hasher provides O(1) worst-case lookup with exactly one memory access—no probing required.

For new keys (inserted after optimization): The hybrid hasher automatically falls back to standard FNV-1a hashing with linear probing:

\[ \text{slot}_i = (h_s(k) + i) \bmod n, \quad i \in [0, \text{MAX\_PROBES} - 1] \]

The maximum probe distance (default 10) ensures bounded worst-case latency.

Key insight: The same slot array serves both purposes. The hybrid hasher checks if a key was in the original optimized set (is_perfect_for(key)), and uses the appropriate hash function accordingly. No static partitioning is needed.

3. Lock-Free Atomic Operations

All operations use compare-and-swap (CAS) primitives and atomic versioning:

Atomic Slot Versioning: Each slot maintains a 64-bit atomic value combining key hash (high 32 bits) and version number (low 32 bits):

// Lock-free read
uint64_t v1 = slot.hash_version.load(acquire);
uint32_t hash = v1 >> 32;
if (hash != hash(key)) return nullopt;  // Key not found

// Copy data while checking version
data = copy(slot.data);
uint64_t v2 = slot.hash_version.load(acquire);

if (v1 == v2 && (v1 & 1) == 0) {  // Even version = consistent
    return data;
}
// Retry if version changed

Writers use a two-phase protocol:

  1. Increment version to odd (marking slot as updating)
  2. Write new data
  3. Increment version to even (marking update complete)

Readers either see old value or new value completely—never partial updates.

Performance Results

Single-Threaded Latency

Operationp50p90p99p99.9p99.99
Random GET351ns601ns972ns1,342ns7,844ns
Sequential GET221ns301ns752ns922ns8,736ns

Sub-microsecond median latency with predictable tail latency under 10µs.

Multi-Threaded Scaling

ThreadsThroughput (M ops/sec)Avg Latency (ns)SpeedupEfficiency
12.693471.00×100%
25.713252.12×106%
411.903094.42×110%
817.243756.40×80%
1698.00-36.43×-

Near-linear scaling up to 4 threads (110% efficiency), demonstrating the effectiveness of lock-free design.

Comparison with Other Systems

SystemThroughputSpeedup vs maph
maph10M ops/sec1.00× (baseline)
Redis0.8M ops/sec12× slower
RocksDB0.11M ops/sec87× slower
Memcached1.6M ops/sec6× slower

Memory Layout and Slot Design

Fixed-Size Slots

Each slot is 512 bytes, aligned to 64-byte cache lines:

Slot Structure (512 bytes):
┌─────────────────────────────┐
│ Metadata (16 bytes)         │
│  - hash_version (8B atomic) │  Upper 32 bits: hash identifier
│  - size (4B)                │  Lower 32 bits: version counter
│  - reserved (4B)            │
├─────────────────────────────┤
│ Data (496 bytes)            │  Actual value storage
└─────────────────────────────┘

Why 512 bytes?

  • Cache alignment: 8× 64-byte cache lines prevents false sharing
  • Typical JSON docs: 100-400 bytes with acceptable overhead
  • Direct computation: slot_addr = base_addr + slot_index × 512
  • SIMD friendly: Enables vectorized operations

The fixed size is configurable at compile time:

// Small values
mmap_storage<128> counters;

// Default
mmap_storage<512> json_store;

// Large values
mmap_storage<4096> documents;

Architecture Components

File Structure:
┌────────────────┐
│ Header (512B)  │  Magic, version, slot counts, generation
├────────────────┤
│ Slot 0 (512B)  │  Hash (32b) + Version (32b) + Data (496B)
├────────────────┤
│ Slot 1 (512B)  │
├────────────────┤
│      ...       │
└────────────────┘

SIMD Batch Operations

AVX2 instructions enable parallel processing of 8 keys simultaneously:

void compute_batch_avx2(const char** keys, uint32_t* hashes, size_t count) {
    __m256i fnv_prime = _mm256_set1_epi32(16777619u);
    __m256i fnv_offset = _mm256_set1_epi32(2166136261u);

    for (size_t i = 0; i + 8 <= count; i += 8) {
        __m256i h = fnv_offset;
        size_t min_len = find_min_length(keys + i, 8);

        for (size_t j = 0; j < min_len; ++j) {
            __m256i chars = gather_chars(keys + i, j);
            h = _mm256_xor_si256(h, chars);
            h = _mm256_mullo_epi32(h, fnv_prime);
        }

        _mm256_storeu_si256((__m256i*)(hashes + i), h);
    }
}

Result: ~5× throughput improvement for batch operations, achieving 50M keys/sec for parallel lookups.

Hybrid Deployment Architecture

The mmap-based design enables a powerful pattern:

+----------------------------------+
| Shared mmap Files (data/*.maph)  |
+----------+-------------+---------+
           |             |
     (write)|             |(read-only)
   +--------v---+  +------v----------+
   | REST API   |  | C++ App         |
   | Server     |  | (direct mmap)   |
   +------------+  +-----------------+
        |
   (HTTP)|
   +-----v----------+
   | Remote Clients |
   +----------------+

Direct mmap read access (C++ app):

// Local C++ application: Direct read-only access (~300ns latency)
auto db = maph::Maph::open("/var/lib/maph/data/cache.maph",
                           true);  // readonly
auto value = db->get("hot_key");  // Zero IPC overhead!

Benefits:

  • Zero-IPC overhead: Local reads bypass HTTP entirely (300ns vs 1-2ms)
  • 5,000× speedup: Direct mmap reads are ~5,000× faster than HTTP requests
  • Safety: Single writer eliminates race conditions without coordination
  • Flexibility: Remote clients use HTTP, local apps use direct mmap

Pattern 2: Multi-Writer with Coordination

For multiple C++ writers, explicit synchronization is necessary:

// Acquire exclusive lock before writing
int lock_fd = open("cache.maph.lock", O_CREAT | O_RDWR);
flock(lock_fd, LOCK_EX);  // Exclusive lock (~2µs overhead)

auto db = maph::Maph::open("cache.maph", false);
db->set("key", "value");

flock(lock_fd, LOCK_UN);  // Release lock
close(lock_fd);

Alternative coordination mechanisms:

  • Application-level tokens (Redis, etcd)
  • Partitioning (different processes own different key ranges)
  • Message queue coordination

REST API Server

The provided REST API server demonstrates practical deployment:

# Start daemon
./maph_daemon --port 8080 --threads 4

# CLI operations
./maph_cli set mystore '{"key": "user:1"}' '{"name": "Alice", "age": 30}'
./maph_cli get mystore '{"key": "user:1"}'

# REST API
curl -X POST http://localhost:8080/stores \
  -d '{"name": "products", "slots": 1000000}'

curl -X PUT 'http://localhost:8080/stores/products/keys/"sku:12345"' \
  -d '{"name": "Laptop", "price": 999}'

curl 'http://localhost:8080/stores/products/keys/"sku:12345"'

# Optimize for perfect hash (O(1) guaranteed)
curl -X POST http://localhost:8080/stores/products/optimize

Access Pattern Performance Comparison

Access PatternLatencyUse Case
Direct mmap (C++)300nsLocal high-performance reads
REST API (local)1-2msRemote or cross-language access
REST API (network)5-10msDistributed systems

Hash Function Selection

FNV-1a Hash (Short Keys)

For keys under 16 bytes, FNV-1a provides excellent distribution with minimal overhead:

uint32_t fnv1a(const char* key, size_t len) {
    uint32_t h = 2166136261u;  // offset basis
    for (size_t i = 0; i < len; ++i) {
        h ^= (uint8_t)key[i];
        h *= 16777619u;  // FNV prime
    }
    return h;
}

Performance: 3-4 cycles per byte on modern CPUs.

xxHash (Long Keys)

For longer keys, xxHash processes 32-bit chunks for superior throughput:

  • Processes 4 bytes per iteration
  • Exploits instruction-level parallelism
  • Achieves 13 GB/s on a single core

Durability Management

Asynchronous Durability

While mmap provides automatic persistence, explicit control is available:

class DurabilityManager {
    void sync_thread() {
        while (running) {
            sleep_for(sync_interval);
            msync(mapped_region, region_size, MS_ASYNC);  // Async writeback
        }
    }
};

// Configure background sync
auto manager = store->start_durability_manager(
    std::chrono::seconds(5),  // Sync every 5 seconds
    1000                       // Or after 1000 operations
);

// Manual sync for strong durability
store->sync();  // MS_SYNC forces synchronous write

Copy-on-Write Snapshots

The OS’s copy-on-write mechanism enables efficient point-in-time backups:

  1. Fork the process to create a snapshot
  2. Child process inherits the memory mapping
  3. Pages are copied only when modified
  4. Snapshot remains consistent during export

Parallel Scan Implementation

Table scans partition the slot array across threads with work-stealing:

void parallel_scan(std::function<void(Slot&)> visitor, size_t num_threads) {
    std::atomic<size_t> next_chunk{0};
    constexpr size_t CHUNK_SIZE = 1024;

    std::vector<std::thread> threads;
    for (size_t t = 0; t < num_threads; ++t) {
        threads.emplace_back([&] {
            size_t chunk;
            while ((chunk = next_chunk.fetch_add(1)) < total_slots / CHUNK_SIZE) {
                size_t start = chunk * CHUNK_SIZE;
                size_t end = std::min(start + CHUNK_SIZE, total_slots);

                for (size_t i = start; i < end; ++i) {
                    if (!slots[i].empty()) {
                        visitor(slots[i]);
                    }
                }
            }
        });
    }

    for (auto& t : threads) t.join();
}

Performance: 100M items/sec for sequential memory access.

Quick Start

# Clone and build
git clone https://github.com/queelius/maph.git
cd maph
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DBUILD_TESTS=ON \
         -DBUILD_REST_API=ON
make -j$(nproc)

# Run tests
ctest --verbose

# Install
sudo make install

C++ Library Usage

#include <maph.hpp>

// Create a new maph store
auto store = maph::Maph::create("mystore.maph", 1000000);

// Store JSON key-value pairs
store->set(R"({"user": "alice"})", R"({"score": 95})");

// Retrieve with O(1) lookup
auto value = store->get(R"({"user": "alice"})");
if (value) {
    std::cout << *value << std::endl;  // {"score": 95}
}

// Batch operations for high throughput
std::vector<std::pair<JsonView, JsonView>> batch = {
    {R"({"id": 1})", R"({"name": "item1"})"},
    {R"({"id": 2})", R"({"name": "item2"})"}
};
store->mset(batch);

// Parallel batch processing
store->parallel_mset(large_batch, 8);  // Use 8 threads

Use Cases

  • High-Performance Cache: Sub-microsecond lookups for hot data
  • Feature Stores: ML feature serving with guaranteed latency
  • Session Storage: Web session data with instant access
  • Rate Limiting: Token bucket implementation with O(1) checks
  • Deduplication: Fast duplicate detection in data streams
  • Graph Databases: Adjacency list storage with perfect hashing
  • Time-Series Index: Fast timestamp to data mapping

Comparison with Other Systems

SystemLookupSpacePersistenceConcurrentUse Case
maphO(1)OptimalmmapLock-freeGeneral K/V, high-perf
RedisO(1) avg10× overheadAOF/RDBSingle-threadCache, sessions
RocksDBO(log n)CompressedSST filesMulti-threadLarge persistent data
MemcachedO(1) avgIn-memoryNoneMulti-threadSimple cache
LMDBO(log n)B+treemmapMVCCTransactional K/V

Design Principles

🚀 Eliminate Kernel Crossings: Every system call incurs 100-200ns overhead

0️⃣ Zero-Copy: Data remains in place from storage to application

⚡ Hardware Parallelism: Leverage ILP (pipelining), SIMD, and multi-core

🎯 Optimize Common Case: Read-heavy workloads with temporal locality

Limitations and Trade-offs

Fixed Slot Size

Each storage instantiation uses a fixed slot size chosen at compile time. Applications with heterogeneous value sizes may experience space overhead. Mitigate by:

  • Choosing appropriate slot sizes for workload
  • Using multiple stores with different slot sizes

Dynamic Key Sets

Perfect hash optimization provides O(1) lookups only for keys in the original dataset. New keys use standard hashing with linear probing. Workloads with highly dynamic key sets must:

  • Provision sufficient slots
  • Periodically rebuild perfect hash function

Single-Machine Scalability

maph operates on a single machine, limited by available memory and CPU cores. Applications requiring horizontal scaling must implement sharding at the application level.

Memory Overhead

maph uses 2.85× more memory than std::unordered_map due to fixed 512-byte slots versus variable-size allocations. This overhead is the price paid for:

  • Predictable memory layout
  • Efficient memory-mapped I/O
  • Multi-process sharing

Future Directions

Runtime-Adaptive Slot Sizing

Hybrid approach with multiple slot classes (64B, 512B, 4KB) dynamically selected based on value size could reduce waste while maintaining predictability.

Learned Indexing

Machine learning models could replace perfect hash functions, learning the key distribution online for better collision rates with skewed distributions.

Persistent Memory Integration

Intel Optane DC Persistent Memory provides byte-addressable persistence with DRAM-like latency. Adapting maph for persistent memory could eliminate the page cache layer.

Distributed Consensus

Integrating consensus protocols like Raft could enable distributed operation while maintaining consistency, with carefully designed protocols minimizing latency impact.

Resources

License

MIT


maph: Where memory-mapped I/O meets perfect hashing for sub-microsecond key-value storage. Zero-copy architecture through mmap, O(1) guarantees through perfect hashing, and linear scalability through lock-free algorithms.

Discussion