maph is a key-value store that gets sub-100 nanosecond median lookup latency. The basic idea: memory-map the entire database file, use perfect hashing to locate keys in a single probe, and do everything lock-free with atomic operations. No kernel transitions on the read path. No copying. No locking.
The problem
Key-value stores hit three walls when you need microsecond-level latency:
- Kernel overhead. System calls cost 100-500ns per operation just for the context switch.
- Memory copying. Traditional stores copy data from kernel buffers to user space, between internal structures, for serialization. Each copy costs time.
- Synchronization. Lock-based concurrency creates contention and unpredictable tail latency.
For most applications, these costs are noise. For things like feature stores in ML inference pipelines, or anything where you’re doing thousands of lookups per request within a tight latency budget, they dominate.
Three techniques
1. Zero-copy via mmap
Memory-map the database file and let the CPU’s MMU handle address translation:
// Traditional approach: Multiple copies
std::vector<uint8_t> data = read_from_disk(key); // Kernel -> user copy
Value v = deserialize(data); // Another copy
process(v); // Yet another copy
// maph approach: Direct memory access
auto store = maph::Maph::open("mystore.maph");
auto value = store->get(key); // Zero copies. Direct pointer into mmap region
The kernel page cache handles persistence automatically. You get the illusion of in-memory access with durability.
2. Hybrid hash architecture
maph uses a hybrid hasher that combines perfect hashing with standard hashing. When you optimize (via the /optimize REST endpoint or optimize() in C++), it constructs a minimal perfect hash function for all current keys:
Known keys get O(1) worst-case lookup with exactly one memory access. No probing. Keys inserted after optimization fall back to FNV-1a with linear probing:
\[ \text{slot}_i = (h_s(k) + i) \bmod n, \quad i \in [0, \text{MAX\_PROBES} - 1] \]The maximum probe distance (default 10) bounds worst-case latency. Both hash paths use the same slot array, so there’s no static partitioning. The hybrid hasher checks whether a key was in the original optimized set and dispatches accordingly.
3. Lock-free atomic operations
Every operation uses compare-and-swap and atomic versioning. Each slot has a 64-bit atomic value: key hash in the upper 32 bits, version counter in the lower 32.
// Lock-free read
uint64_t v1 = slot.hash_version.load(acquire);
uint32_t hash = v1 >> 32;
if (hash != hash(key)) return nullopt; // Key not found
// Copy data while checking version
data = copy(slot.data);
uint64_t v2 = slot.hash_version.load(acquire);
if (v1 == v2 && (v1 & 1) == 0) { // Even version = consistent
return data;
}
// Retry if version changed
Writers use a two-phase protocol: increment version to odd (updating), write data, increment to even (done). Readers always see either the old value or the new value completely, never a partial update.
Numbers
Single-threaded latency
| Operation | p50 | p90 | p99 | p99.9 | p99.99 |
|---|---|---|---|---|---|
| Random GET | 351ns | 601ns | 972ns | 1,342ns | 7,844ns |
| Sequential GET | 221ns | 301ns | 752ns | 922ns | 8,736ns |
Sub-microsecond medians with tail latency under 10 microseconds.
Multi-threaded scaling
| Threads | Throughput (M ops/sec) | Avg Latency (ns) | Speedup | Efficiency |
|---|---|---|---|---|
| 1 | 2.69 | 347 | 1.00x | 100% |
| 2 | 5.71 | 325 | 2.12x | 106% |
| 4 | 11.90 | 309 | 4.42x | 110% |
| 8 | 17.24 | 375 | 6.40x | 80% |
| 16 | 98.00 | - | 36.43x | - |
Near-linear scaling up to 4 threads. The superlinear efficiency at low thread counts is from cache effects (more L3 cache in play).
Comparison
| System | Throughput | Relative |
|---|---|---|
| maph | 10M ops/sec | baseline |
| Redis | 0.8M ops/sec | 12x slower |
| RocksDB | 0.11M ops/sec | 87x slower |
| Memcached | 1.6M ops/sec | 6x slower |
These are single-machine numbers. maph is not a distributed system. That’s an important caveat.
Memory layout
Each slot is 512 bytes, aligned to 64-byte cache lines:
Slot Structure (512 bytes):
+-----------------------------+
| Metadata (16 bytes) |
| - hash_version (8B atomic) | Upper 32 bits: hash identifier
| - size (4B) | Lower 32 bits: version counter
| - reserved (4B) |
+-----------------------------+
| Data (496 bytes) | Actual value storage
+-----------------------------+
Why 512? It’s 8 cache lines (prevents false sharing), fits typical JSON documents (100-400 bytes), makes address computation trivial (base + index * 512), and plays well with SIMD. The size is configurable at compile time:
mmap_storage<128> counters; // Small values
mmap_storage<512> json_store; // Default
mmap_storage<4096> documents; // Large values
SIMD batch operations
AVX2 lets you process 8 keys in parallel:
void compute_batch_avx2(const char** keys, uint32_t* hashes, size_t count) {
__m256i fnv_prime = _mm256_set1_epi32(16777619u);
__m256i fnv_offset = _mm256_set1_epi32(2166136261u);
for (size_t i = 0; i + 8 <= count; i += 8) {
__m256i h = fnv_offset;
size_t min_len = find_min_length(keys + i, 8);
for (size_t j = 0; j < min_len; ++j) {
__m256i chars = gather_chars(keys + i, j);
h = _mm256_xor_si256(h, chars);
h = _mm256_mullo_epi32(h, fnv_prime);
}
_mm256_storeu_si256((__m256i*)(hashes + i), h);
}
}
This gets roughly 5x throughput improvement for batch operations, around 50M keys/sec for parallel lookups.
Deployment patterns
The mmap architecture enables a useful pattern: single writer, multiple readers.
+----------------------------------+
| Shared mmap Files (data/*.maph) |
+----------+-------------+---------+
| |
(write)| |(read-only)
+--------v---+ +------v----------+
| REST API | | C++ App |
| Server | | (direct mmap) |
+------------+ +-----------------+
|
(HTTP)|
+-----v----------+
| Remote Clients |
+----------------+
Local C++ applications can open the mmap file read-only and get 300ns latency. That’s roughly 5,000x faster than going through HTTP. Remote clients use the REST API. The single-writer constraint eliminates race conditions without coordination overhead.
// Local C++ application: Direct read-only access (~300ns latency)
auto db = maph::Maph::open("/var/lib/maph/data/cache.maph",
true); // readonly
auto value = db->get("hot_key"); // Zero IPC overhead
For multi-writer scenarios, you need explicit coordination (file locks, partitioning, or application-level tokens).
REST API
The REST server provides a standard interface:
# Start daemon
./maph_daemon --port 8080 --threads 4
# CLI operations
./maph_cli set mystore '{"key": "user:1"}' '{"name": "Alice", "age": 30}'
./maph_cli get mystore '{"key": "user:1"}'
# REST API
curl -X POST http://localhost:8080/stores \
-d '{"name": "products", "slots": 1000000}'
curl -X PUT 'http://localhost:8080/stores/products/keys/"sku:12345"' \
-d '{"name": "Laptop", "price": 999}'
curl 'http://localhost:8080/stores/products/keys/"sku:12345"'
# Optimize for perfect hash (O(1) guaranteed)
curl -X POST http://localhost:8080/stores/products/optimize
| Access Pattern | Latency | Use Case |
|---|---|---|
| Direct mmap (C++) | 300ns | Local high-performance reads |
| REST API (local) | 1-2ms | Remote or cross-language access |
| REST API (network) | 5-10ms | Distributed systems |
Durability
mmap provides automatic persistence through the kernel page cache, but you can control sync behavior:
class DurabilityManager {
void sync_thread() {
while (running) {
sleep_for(sync_interval);
msync(mapped_region, region_size, MS_ASYNC); // Async writeback
}
}
};
// Configure background sync
auto manager = store->start_durability_manager(
std::chrono::seconds(5), // Sync every 5 seconds
1000 // Or after 1000 operations
);
// Manual sync for strong durability
store->sync(); // MS_SYNC forces synchronous write
Copy-on-write snapshots come for free: fork the process, and the child inherits the mapping. Pages are only copied when modified.
Parallel scans
Table scans partition the slot array across threads with work-stealing:
void parallel_scan(std::function<void(Slot&)> visitor, size_t num_threads) {
std::atomic<size_t> next_chunk{0};
constexpr size_t CHUNK_SIZE = 1024;
std::vector<std::thread> threads;
for (size_t t = 0; t < num_threads; ++t) {
threads.emplace_back([&] {
size_t chunk;
while ((chunk = next_chunk.fetch_add(1)) < total_slots / CHUNK_SIZE) {
size_t start = chunk * CHUNK_SIZE;
size_t end = std::min(start + CHUNK_SIZE, total_slots);
for (size_t i = start; i < end; ++i) {
if (!slots[i].empty()) {
visitor(slots[i]);
}
}
}
});
}
for (auto& t : threads) t.join();
}
Sequential memory access gives about 100M items/sec for scans.
Trade-offs and limitations
I should be honest about what maph is not.
Fixed slot size. Each store uses one slot size, chosen at compile time. If your values vary wildly in size, you’ll waste space. You can mitigate this with multiple stores at different slot sizes, but that’s added complexity.
Dynamic key sets. Perfect hashing only covers keys present at optimization time. New keys use standard hashing with probing. If your key set churns heavily, you need to re-optimize periodically.
Single machine. maph doesn’t do distribution. If you need horizontal scaling, you’re implementing sharding yourself.
Memory overhead. 2.85x more memory than std::unordered_map due to the fixed 512-byte slots. That’s the price for predictable layout, mmap compatibility, and multi-process sharing.
Quick start
git clone https://github.com/queelius/maph.git
cd maph
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTS=ON \
-DBUILD_REST_API=ON
make -j$(nproc)
ctest --verbose
sudo make install
#include <maph.hpp>
auto store = maph::Maph::create("mystore.maph", 1000000);
store->set(R"({"user": "alice"})", R"({"score": 95})");
auto value = store->get(R"({"user": "alice"})");
if (value) {
std::cout << *value << std::endl;
}
// Batch operations
std::vector<std::pair<JsonView, JsonView>> batch = {
{R"({"id": 1})", R"({"name": "item1"})"},
{R"({"id": 2})", R"({"name": "item2"})"}
};
store->mset(batch);
// Parallel batch processing
store->parallel_mset(large_batch, 8); // Use 8 threads
Comparison
| System | Lookup | Space | Persistence | Concurrent | Notes |
|---|---|---|---|---|---|
| maph | O(1) | Optimal | mmap | Lock-free | General K/V |
| Redis | O(1) avg | 10x overhead | AOF/RDB | Single-thread | Cache, sessions |
| RocksDB | O(log n) | Compressed | SST files | Multi-thread | Large persistent data |
| Memcached | O(1) avg | In-memory | None | Multi-thread | Simple cache |
| LMDB | O(log n) | B+tree | mmap | MVCC | Transactional K/V |
Resources
- github.com/queelius/maph
- Technical report
- REST API: see
integrations/rest_api/ - Examples: see
examples/
MIT licensed.
Discussion