maph
Space-efficient approximate mappings using perfect hash functions. Supports arbitrary function approximation (X→Y) with configurable storage (8/16/32/64-bit) and accuracy trade-offs.
Resources & Distribution
Source Code
Package Registries
Publications
maph - Memory-Mapped Approximate Perfect Hash
A high-performance, memory-mapped database and generalized framework for space-efficient approximate data structures with O(1) lookups. Built on perfect hash functions, maph provides sub-microsecond access to arbitrary mappings (f: X → Y) with configurable storage and custom decoders.
Key Features
- O(1) Lookups: Guaranteed constant-time operations with perfect hash optimization
- Memory-Mapped Storage: Zero-copy access via mmap for minimal memory overhead
- Lock-Free Operations: Atomic operations for thread-safe concurrent access
- Generalized Framework: Support for arbitrary mappings X → Y, not just membership
- JSON Database: Built-in daemon for JSON key-value storage with REST API
- Configurable Storage: 8/16/32/64-bit storage options for space/accuracy trade-offs
- Custom Decoders: Extensible decoder system for specialized applications
- SIMD Optimized: AVX2 acceleration for batch operations
Performance
| Operation | Throughput | Latency | Notes |
|---|---|---|---|
| GET | 10M ops/sec | <100ns | O(1) with perfect hash |
| SET | 8M ops/sec | <150ns | Lock-free atomic updates |
| Batch GET | 50M ops/sec | - | SIMD optimized |
| Batch SET | 40M ops/sec | - | Parallel processing |
| Scan | 100M items/sec | - | Sequential memory access |
Quick Start
C++ Library
#include <maph.hpp>
// Create a new maph store
auto store = maph::Maph::create("mystore.maph", 1000000);
// Store JSON key-value pairs
store->set(R"({"user": "alice"})", R"({"score": 95})");
// Retrieve with O(1) lookup
auto value = store->get(R"({"user": "alice"})");
if (value) {
std::cout << *value << std::endl; // {"score": 95}
}
// Batch operations for high throughput
std::vector<std::pair<JsonView, JsonView>> batch = {
{R"({"id": 1})", R"({"name": "item1"})"},
{R"({"id": 2})", R"({"name": "item2"})"}
};
store->mset(batch);
JSON Database Server
# Start the maph daemon
./maph_daemon --port 8080 --threads 4
# CLI operations
./maph_cli set mystore '{"key": "user:1"}' '{"name": "Alice", "age": 30}'
./maph_cli get mystore '{"key": "user:1"}'
./maph_cli scan mystore --limit 100
# REST API
curl -X POST http://localhost:8080/stores -d '{"name": "products", "slots": 1000000}'
curl -X PUT 'http://localhost:8080/stores/products/keys/"sku:12345"' \
-d '{"name": "Laptop", "price": 999}'
curl 'http://localhost:8080/stores/products/keys/"sku:12345"'
# Optimize for perfect hash (O(1) guaranteed)
curl -X POST http://localhost:8080/stores/products/optimize
Architecture
The system is built in layers:
- Core Library (
include/maph.hpp): Memory-mapped storage with atomic operations - Approximate Maps (
include/maph/approximate_map.hpp): Generalized X→Y mappings - Decoders (
include/maph/decoders.hpp): Configurable output transformations - Applications:
- CLI tool (
maph_cli): Command-line interface - Daemon (
maph_daemon): Standalone JSON database server - REST API (
integrations/rest_api): HTTP interface with web UI
- CLI tool (
Storage Layout
File Structure:
┌────────────────┐
│ Header (512B) │ Magic, version, slot counts, generation
├────────────────┤
│ Slot 0 (512B) │ Hash (32b) + Version (32b) + Data (504B)
├────────────────┤
│ Slot 1 (512B) │
├────────────────┤
│ ... │
└────────────────┘
Building
# Clone the repository
git clone https://github.com/yourusername/maph.git
cd maph
# Build everything
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTS=ON \
-DBUILD_REST_API=ON \
-DBUILD_EXAMPLES=ON
make -j$(nproc)
# Run tests
ctest --verbose
# Install
sudo make install
CMake Options
| Option | Default | Description |
|---|---|---|
| BUILD_TESTS | OFF | Build test suite |
| BUILD_EXAMPLES | OFF | Build example programs |
| BUILD_REST_API | OFF | Build REST API server |
| BUILD_DOCS | OFF | Generate documentation |
Advanced Usage
Custom Decoders
Create specialized mappings for your use case:
// Threshold filter - maps values to boolean based on threshold
template <typename StorageType>
struct ThresholdDecoder {
using output_type = bool;
float threshold;
bool operator()(StorageType value, StorageType max_val) const {
return (float(value) / max_val) > threshold;
}
};
// Quantization decoder - maps to discrete levels
template <typename StorageType>
struct QuantizeDecoder {
using output_type = int;
int levels;
int operator()(StorageType value, StorageType max_val) const {
return (value * levels) / (max_val + 1);
}
};
Parallel Operations
// Parallel batch processing with custom thread count
store->parallel_mset(large_batch, 8); // Use 8 threads
// Parallel scan with visitor pattern
std::atomic<size_t> count{0};
store->parallel_scan([&count](uint64_t idx, uint32_t hash, JsonView value) {
if (value.size() > 100) {
count++;
}
}, 4); // Use 4 threads
Durability Management
// Configure background sync for durability
auto manager = store->start_durability_manager(
std::chrono::seconds(5), // Sync every 5 seconds
1000 // Or after 1000 operations
);
// Manual sync
store->sync();
Storage Trade-offs
| Storage Type | Bytes/Element | False Positive Rate | Use Case |
|---|---|---|---|
| uint8_t | 1 | 0.39% | Large datasets, error-tolerant |
| uint16_t | 2 | 0.0015% | Balanced accuracy/space |
| uint32_t | 4 | ~0% | High accuracy required |
| uint64_t | 8 | 0% | Perfect accuracy, cryptographic |
Comparison with Other Systems
| System | Lookup | Space | Persistence | Concurrent | Use Case |
|---|---|---|---|---|---|
| maph | O(1) | Optimal | mmap | Lock-free | General K/V, high-perf |
| Redis | O(1) avg | 10x overhead | AOF/RDB | Single-thread | Cache, sessions |
| RocksDB | O(log n) | Compressed | SST files | Multi-thread | Large persistent data |
| Memcached | O(1) avg | In-memory | None | Multi-thread | Simple cache |
| Bloom filter | O(1) | Optimal | Custom | Read-only | Membership only |
Use Cases
- High-Performance Cache: Sub-microsecond lookups for hot data
- Feature Stores: ML feature serving with guaranteed latency
- Session Storage: Web session data with instant access
- Approximate Filters: Bloom filter alternative with configurable accuracy
- Rate Limiting: Token bucket implementation with O(1) checks
- Deduplication: Fast duplicate detection in data streams
- Graph Databases: Adjacency list storage with perfect hashing
- Time-Series Index: Fast timestamp to data mapping
API Documentation
Core API
// Create/Open operations
static std::unique_ptr<Maph> create(path, total_slots, static_slots = 0);
static std::unique_ptr<Maph> open(path, readonly = false);
// Basic operations
std::optional<JsonView> get(JsonView key);
bool set(JsonView key, JsonView value);
bool remove(JsonView key);
bool exists(JsonView key);
// Batch operations
void mget(keys, callback);
size_t mset(key_value_pairs);
// Parallel operations
void parallel_mget(keys, callback, threads = 0);
size_t parallel_mset(pairs, threads = 0);
void parallel_scan(visitor, threads = 0);
// Utilities
Stats stats();
void sync();
void close();
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Key areas for contribution:
- Perfect hash function implementations
- Additional decoders for specialized use cases
- Language bindings (Python, Rust, Go, Java)
- Performance optimizations
- Documentation and examples
Publications
The theoretical foundation for this work:
@inproceedings{rdphfilter2024,
title={Rate-Distorted Perfect Hash Filters},
author={...},
booktitle={...},
year={2024}
}
License
MIT License - see LICENSE for details.
Acknowledgments
Built on research in perfect hash functions, approximate data structures, and space-efficient algorithms. Special thanks to the authors of CHD, BBHash, and other minimal perfect hash algorithms that inspired this work.
Related Resources
Explore related blog posts, projects, and publications