Memory Access Performance Analysis
The Zero-Overhead Claim
Claim: Shared memory reads are as fast as normal array reads.
Proof: Benchmarked on Intel i7-12700K @ 5.0GHz, 32GB DDR5-5600
=== Read Performance Benchmark ===
Array size: 10000 integers
Iterations: 1000000
Sequential Read Performance:
-----------------------------
Heap array (sequential): 2.324 ns/operation
Stack array (sequential): 2.654 ns/operation
Shared array operator[] (sequential): 2.318 ns/operation ← Identical!
Shared array raw pointer (sequential): 2.316 ns/operation ← Direct access
Random Access Performance:
-----------------------------
Heap array (random): 2.327 ns/operation
Shared array operator[] (random): 2.326 ns/operation ← Same penalty
Shared array raw pointer (random): 2.378 ns/operation
Why It's So Fast
1. <strong>Memory Hierarchy - Identical Path</strong>
CPU Register (0 cycles)
↓
L1 Cache (4 cycles, ~0.8ns)
↓
L2 Cache (12 cycles, ~2.4ns)
↓
L3 Cache (42 cycles, ~8.4ns)
↓
Main Memory (200+ cycles, ~40ns)
Both heap and shared memory follow the exact same path.
2. <strong>Assembly Analysis</strong>
; Normal array access
mov rax, QWORD PTR [rbp-24] ; Load base pointer
mov edx, DWORD PTR [rbp-28] ; Load index
mov eax, DWORD PTR [rax+rdx*4] ; Read array[index]
; Shared memory access (after setup)
mov rax, QWORD PTR [rbp-32] ; Load base pointer
mov edx, DWORD PTR [rbp-36] ; Load index
mov eax, DWORD PTR [rax+rdx*4] ; Read array[index] - IDENTICAL!
3. <strong>Cache Line Behavior</strong>
- 64-byte cache lines loaded identically
- Hardware prefetching works the same
- Spatial locality preserved
- Temporal locality preserved
Lock-Free Performance
Atomic Operations Timing
Operation | x86-64 Cycles | Time (5GHz) | Notes |
Load (relaxed) | 1 | 0.2ns | Same as normal load |
Store (relaxed) | 1 | 0.2ns | Same as normal store |
CAS (uncontended) | 10-20 | 2-4ns | Lock cmpxchg |
CAS (contended) | 100-300 | 20-60ns | Cache line ping-pong |
Fetch-Add | 10-20 | 2-4ns | Lock xadd |
Queue Performance
Single Producer/Consumer: 8-12ns per operation
Multiple Producers (4): 25-40ns per operation (contention)
Batch Operations (n=100): 2-3ns amortized per item
Object Pool Performance
Object Pool acquire(): 10-15ns ← Lock-free stack
malloc(): 40-80ns ← System allocator
new T(): 45-85ns ← C++ allocator
mmap(): 500-1000ns ← System call
Optimization Techniques
1. <strong>Cache Line Alignment</strong>
struct alignas(64) CacheAligned {
std::atomic<uint64_t> counter;
char padding[56];
};
Impact: 10-50x improvement under contention
2. <strong>Huge Pages (2MB/1GB)</strong>
# Enable huge pages
echo 1024 > /proc/sys/vm/nr_hugepages
# Mount hugetlbfs
mount -t hugetlbfs none /mnt/hugepages
# Use MAP_HUGETLB flag
mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_HUGETLB, fd, 0);
Impact:
- Reduces TLB misses by 512x (4KB→2MB)
- 5-15% performance improvement for large datasets
3. <strong>NUMA Awareness</strong>
numa_set_localalloc();
numa_tonode_memory(addr, size, numa_node_of_cpu(cpu));
int distance = numa_distance(node1, node2);
Impact:
- Local access: ~50ns
- Remote access: ~100-150ns
- 2-3x penalty for remote NUMA access
4. <strong>Prefetching</strong>
for (int i = 0; i < n; i++) {
__builtin_prefetch(&array[indices[i+8]], 0, 1);
process(array[indices[i]]);
}
Impact: 20-40% improvement for random patterns
Real-World Benchmarks
Particle Simulation (100K particles)
Traditional (message passing):
- Serialize particles: 850 µs
- Send via socket: 420 µs
- Deserialize: 780 µs
- Total: 2050 µs
Shared Memory:
- Write to shm_array: 12 µs ← 170x faster!
- Read from shm_array: 8 µs
- Total: 20 µs
Sensor Data Pipeline (1MHz sampling)
Traditional (pipes):
- Max throughput: 50K samples/sec
- Latency: 20-50 µs
- CPU usage: 45%
Shared Memory (ring buffer):
- Max throughput: 10M samples/sec ← 200x higher!
- Latency: 50-100 ns ← 400x lower!
- CPU usage: 8%
Memory Overhead
Table Size Configurations
Configuration | Table Overhead | Use Case |
shm_table_small (16,16) | 904 bytes | Embedded, minimal |
shm_table (32,64) | 4,168 bytes | Default, balanced |
shm_table_large (64,256) | 26,632 bytes | Complex simulations |
shm_table_huge (256,1024) | 422,920 bytes | Maximum flexibility |
Per-Structure Overhead
Fixed-size array in shared memory with zero-overhead access.
Shared memory atomic value with auto-discovery.
High-performance object pool for shared memory.
Lock-free circular queue for shared memory IPC.
Lock-free ring buffer for high-throughput streaming data.
Scalability Analysis
Process Scaling
Readers Throughput (ops/sec)
1 450M
2 890M (1.98x)
4 1750M (3.89x)
8 3400M (7.56x)
16 6200M (13.8x)
Near-linear scaling for read-heavy workloads!
Contention Characteristics
Writers Queue Throughput Array Writes
1 120M ops/sec 450M ops/sec
2 95M ops/sec 380M ops/sec
4 70M ops/sec 290M ops/sec
8 45M ops/sec 180M ops/sec
Platform-Specific Notes
Linux
- Best performance with
MAP_POPULATE
- Use
madvise(MADV_HUGEPAGE)
for THP
- Consider
memfd_create()
for anonymous shared memory
macOS
- Limited to 4GB shared memory by default
- Increase with
kern.sysv.shmmax
sysctl
- No huge page support
FreeBSD
- Excellent performance with
minherit(INHERIT_SHARE)
- Support for super pages via
mmap(MAP_ALIGNED_SUPER)
Profiling & Tuning
Key Metrics to Monitor
- Cache Misses
perf stat -e cache-misses,cache-references ./app
- TLB Misses
perf stat -e dTLB-load-misses ./app
- False Sharing
perf c2c record ./app
perf c2c report
- Lock Contention
perf record -e lock:* ./app
perf report
Best Practices Summary
✅ DO:
- Align structures to cache lines (64 bytes)
- Use huge pages for datasets > 10MB
- Batch operations when possible
- Profile with
perf
on Linux
- Consider NUMA topology
❌ DON'T:
- Share cache lines between writers
- Use atomic operations unnecessarily
- Assume uniform memory access on NUMA
- Forget to handle page faults gracefully
- Mix frequently/infrequently accessed data