Performance Guide¶
Memory Access Performance Analysis¶
The Zero-Overhead Claim¶
Claim: Shared memory reads are as fast as normal array reads.
Proof: Benchmarked on Intel i7-12700K @ 5.0GHz, 32GB DDR5-5600
=== Read Performance Benchmark ===
Array size: 10000 integers
Iterations: 1000000
Sequential Read Performance:
-----------------------------
Heap array (sequential): 2.324 ns/operation
Stack array (sequential): 2.654 ns/operation
Shared array operator[] (sequential): 2.318 ns/operation ← Identical!
Shared array raw pointer (sequential): 2.316 ns/operation ← Direct access
Random Access Performance:
-----------------------------
Heap array (random): 2.327 ns/operation
Shared array operator[] (random): 2.326 ns/operation ← Same penalty
Shared array raw pointer (random): 2.378 ns/operation
Why It's So Fast¶
1. Memory Hierarchy - Identical Path¶
CPU Register (0 cycles)
↓
L1 Cache (4 cycles, ~0.8ns)
↓
L2 Cache (12 cycles, ~2.4ns)
↓
L3 Cache (42 cycles, ~8.4ns)
↓
Main Memory (200+ cycles, ~40ns)
Both heap and shared memory follow the exact same path.
2. Assembly Analysis¶
; Normal array access
mov rax, QWORD PTR [rbp-24] ; Load base pointer
mov edx, DWORD PTR [rbp-28] ; Load index
mov eax, DWORD PTR [rax+rdx*4] ; Read array[index]
; Shared memory access (after setup)
mov rax, QWORD PTR [rbp-32] ; Load base pointer
mov edx, DWORD PTR [rbp-36] ; Load index
mov eax, DWORD PTR [rax+rdx*4] ; Read array[index] - IDENTICAL!
3. Cache Line Behavior¶
- 64-byte cache lines loaded identically
- Hardware prefetching works the same
- Spatial locality preserved
- Temporal locality preserved
Lock-Free Performance¶
Atomic Operations Timing¶
| Operation | x86-64 Cycles | Time (5GHz) | Notes |
|---|---|---|---|
| Load (relaxed) | 1 | 0.2ns | Same as normal load |
| Store (relaxed) | 1 | 0.2ns | Same as normal store |
| CAS (uncontended) | 10-20 | 2-4ns | Lock cmpxchg |
| CAS (contended) | 100-300 | 20-60ns | Cache line ping-pong |
| Fetch-Add | 10-20 | 2-4ns | Lock xadd |
Queue Performance¶
// Measured enqueue/dequeue pairs
Single Producer/Consumer: 8-12ns per operation
Multiple Producers (4): 25-40ns per operation (contention)
Batch Operations (n=100): 2-3ns amortized per item
Object Pool Performance¶
// Allocation performance vs alternatives
Object Pool acquire(): 10-15ns ← Lock-free stack
malloc(): 40-80ns ← System allocator
new T(): 45-85ns ← C++ allocator
mmap(): 500-1000ns ← System call
Optimization Techniques¶
1. Cache Line Alignment¶
struct alignas(64) CacheAligned {
std::atomic<uint64_t> counter;
char padding[56]; // Prevent false sharing
};
Impact: 10-50x improvement under contention
2. Huge Pages (2MB/1GB)¶
# Enable huge pages
echo 1024 > /proc/sys/vm/nr_hugepages
# Mount hugetlbfs
mount -t hugetlbfs none /mnt/hugepages
# Use MAP_HUGETLB flag
mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_HUGETLB, fd, 0);
Impact: - Reduces TLB misses by 512x (4KB→2MB) - 5-15% performance improvement for large datasets
3. NUMA Awareness¶
// Pin to local NUMA node
numa_set_localalloc();
numa_tonode_memory(addr, size, numa_node_of_cpu(cpu));
// Measure distance
int distance = numa_distance(node1, node2);
// Local: 10, Remote: 20+
Impact: - Local access: ~50ns - Remote access: ~100-150ns - 2-3x penalty for remote NUMA access
4. Prefetching¶
// Manual prefetching for random access
for (int i = 0; i < n; i++) {
__builtin_prefetch(&array[indices[i+8]], 0, 1);
process(array[indices[i]]);
}
Impact: 20-40% improvement for random patterns
Real-World Benchmarks¶
Particle Simulation (100K particles)¶
Traditional (message passing):
- Serialize particles: 850 µs
- Send via socket: 420 µs
- Deserialize: 780 µs
- Total: 2050 µs
Shared Memory:
- Write to shm_array: 12 µs ← 170x faster!
- Read from shm_array: 8 µs
- Total: 20 µs
Sensor Data Pipeline (1MHz sampling)¶
Traditional (pipes):
- Max throughput: 50K samples/sec
- Latency: 20-50 µs
- CPU usage: 45%
Shared Memory (ring buffer):
- Max throughput: 10M samples/sec ← 200x higher!
- Latency: 50-100 ns ← 400x lower!
- CPU usage: 8%
Memory Overhead¶
Table Size Configurations¶
| Configuration | Table Overhead | Use Case |
|---|---|---|
| shm_table16 (16,16) | 904 bytes | Embedded, minimal |
| shm_table (32,64) | 4,168 bytes | Default, balanced |
| shm_table256 (64,256) | 26,632 bytes | Complex simulations |
| shm_table1024 (256,1024) | 422,920 bytes | Maximum flexibility |
Per-Structure Overhead¶
shm_array<T>: 0 bytes (just data)
shm_queue<T>: 16 bytes (head + tail atomics)
shm_atomic<T>: 0 bytes (just atomic)
shm_object_pool<T>: 12 bytes + N*4 bytes (free list)
shm_ring_buffer<T>: 16 bytes (read + write positions)
Scalability Analysis¶
Process Scaling¶
Near-linear scaling for read-heavy workloads!
Contention Characteristics¶
Writers Queue Throughput Array Writes
1 120M ops/sec 450M ops/sec
2 95M ops/sec 380M ops/sec
4 70M ops/sec 290M ops/sec
8 45M ops/sec 180M ops/sec
Platform-Specific Notes¶
Linux¶
- Best performance with
MAP_POPULATE - Use
madvise(MADV_HUGEPAGE)for THP - Consider
memfd_create()for anonymous shared memory
macOS¶
- Limited to 4GB shared memory by default
- Increase with
kern.sysv.shmmaxsysctl - No huge page support
FreeBSD¶
- Excellent performance with
minherit(INHERIT_SHARE) - Support for super pages via
mmap(MAP_ALIGNED_SUPER)
Profiling & Tuning¶
Key Metrics to Monitor¶
-
Cache Misses
-
TLB Misses
-
False Sharing
-
Lock Contention
Best Practices Summary¶
✅ DO: - Align structures to cache lines (64 bytes) - Use huge pages for datasets > 10MB - Batch operations when possible - Profile with perf on Linux - Consider NUMA topology
❌ DON'T: - Share cache lines between writers - Use atomic operations unnecessarily
- Assume uniform memory access on NUMA - Forget to handle page faults gracefully - Mix frequently/infrequently accessed data