Space-efficient approximate mappings using perfect hash functions
Maph (Memory-mapped Approximate Perfect Hash) is a high-performance key-value database that combines memory-mapped I/O with approximate perfect hashing to achieve sub-microsecond lookups. The system is designed for read-heavy workloads where lookup performance is critical.
┌─────────────────────────────────────────┐
│ Application Layer │
│ (REST API, CLI, Client Libraries) │
├─────────────────────────────────────────┤
│ Maph Core API │
│ (get, set, remove, batch operations) │
├─────────────────────────────────────────┤
│ Hash Function Layer │
│ (FNV-1a, SIMD optimization) │
├─────────────────────────────────────────┤
│ Storage Management │
│ (Slot allocation, version control) │
├─────────────────────────────────────────┤
│ Memory-Mapped I/O (mmap) │
│ (OS page cache, virtual memory) │
├─────────────────────────────────────────┤
│ File System │
│ (Persistent storage) │
└─────────────────────────────────────────┘
The database file consists of a header followed by a fixed array of slots:
┌────────────────┐
│ Header │ 512 bytes
│ (Metadata) │
├────────────────┤
│ Slot 0 │ 512 bytes
├────────────────┤
│ Slot 1 │ 512 bytes
├────────────────┤
│ ... │
├────────────────┤
│ Slot N-1 │ 512 bytes
└────────────────┘
| Field | Size | Description |
|---|---|---|
| magic | 4 bytes | File format identifier (0x4D415048 = “MAPH”) |
| version | 4 bytes | Database format version |
| total_slots | 8 bytes | Total number of slots |
| static_slots | 8 bytes | Number of perfect hash slots |
| generation | 8 bytes | Global modification counter |
| reserved | 480 bytes | Future extensions |
| Field | Size | Description |
|---|---|---|
| hash_version | 8 bytes | Atomic 64-bit: hash (32) + version (32) |
| size | 4 bytes | Value data size |
| reserved | 4 bytes | Future use |
| data | 496 bytes | Actual value storage |
The system uses FNV-1a (Fowler-Noll-Vo) hash function for its simplicity and good distribution:
uint32_t fnv1a(const char* data, size_t len) {
uint32_t hash = 2166136261u; // FNV offset basis
for (size_t i = 0; i < len; i++) {
hash ^= data[i];
hash *= 16777619u; // FNV prime
}
return hash;
}
slot_index = hash % total_slots
For static slots (perfect hashing region):
The database uses a two-tier approach:
for (size_t i = 0; i < MAX_PROBE_DISTANCE; i++) {
uint32_t idx = (base_index + i) % total_slots;
if (slots[idx].empty() || slots[idx].hash() == target_hash) {
// Found target slot
}
}
| Operation | Thread Safety | Notes |
|---|---|---|
| get() | ✅ Safe | Lock-free reads via atomic operations |
| set() | ⚠️ Unsafe | Requires external synchronization |
| remove() | ⚠️ Unsafe | Requires external synchronization |
| scan() | ✅ Safe | Read-only operation |
| parallel_* | ✅ Safe | Internal thread management |
The hash_version field uses atomic operations for consistency:
Readers can safely access slots without locks:
The OS transparently manages data movement between tiers via the page cache.
✅ Good fit for:
❌ Not ideal for: