Performance Guide¶

Memory Access Performance Analysis¶

The Zero-Overhead Claim¶

Claim: Shared memory reads are as fast as normal array reads.

Proof: Benchmarked on Intel i7-12700K @ 5.0GHz, 32GB DDR5-5600

=== Read Performance Benchmark ===
Array size: 10000 integers
Iterations: 1000000

Sequential Read Performance:
-----------------------------
Heap array (sequential):              2.324 ns/operation
Stack array (sequential):             2.654 ns/operation
Shared array operator[] (sequential): 2.318 ns/operation  ← Identical!
Shared array raw pointer (sequential): 2.316 ns/operation ← Direct access

Random Access Performance:
-----------------------------
Heap array (random):                  2.327 ns/operation
Shared array operator[] (random):     2.326 ns/operation  ← Same penalty
Shared array raw pointer (random):    2.378 ns/operation

Why It's So Fast¶

1. Memory Hierarchy - Identical Path¶

CPU Register (0 cycles)
     ↓
L1 Cache (4 cycles, ~0.8ns)
     ↓
L2 Cache (12 cycles, ~2.4ns)
     ↓
L3 Cache (42 cycles, ~8.4ns)
     ↓
Main Memory (200+ cycles, ~40ns)

Both heap and shared memory follow the exact same path.

2. Assembly Analysis¶

; Normal array access
mov     rax, QWORD PTR [rbp-24]  ; Load base pointer
mov     edx, DWORD PTR [rbp-28]  ; Load index
mov     eax, DWORD PTR [rax+rdx*4] ; Read array[index]

; Shared memory access (after setup)
mov     rax, QWORD PTR [rbp-32]  ; Load base pointer  
mov     edx, DWORD PTR [rbp-36]  ; Load index
mov     eax, DWORD PTR [rax+rdx*4] ; Read array[index] - IDENTICAL!

3. Cache Line Behavior¶

64-byte cache lines loaded identically
Hardware prefetching works the same
Spatial locality preserved
Temporal locality preserved

Lock-Free Performance¶

Atomic Operations Timing¶

Operation	x86-64 Cycles	Time (5GHz)	Notes
Load (relaxed)	1	0.2ns	Same as normal load
Store (relaxed)	1	0.2ns	Same as normal store
CAS (uncontended)	10-20	2-4ns	Lock cmpxchg
CAS (contended)	100-300	20-60ns	Cache line ping-pong
Fetch-Add	10-20	2-4ns	Lock xadd

Queue Performance¶

// Measured enqueue/dequeue pairs
Single Producer/Consumer:  8-12ns per operation
Multiple Producers (4):    25-40ns per operation (contention)
Batch Operations (n=100):  2-3ns amortized per item

Object Pool Performance¶

// Allocation performance vs alternatives
Object Pool acquire():     10-15ns   ← Lock-free stack
malloc():                  40-80ns   ← System allocator
new T():                   45-85ns   ← C++ allocator
mmap():                    500-1000ns ← System call

Optimization Techniques¶

1. Cache Line Alignment¶

struct alignas(64) CacheAligned {
    std::atomic<uint64_t> counter;
    char padding[56];  // Prevent false sharing
};

Impact: 10-50x improvement under contention

2. Huge Pages (2MB/1GB)¶

# Enable huge pages
echo 1024 > /proc/sys/vm/nr_hugepages

# Mount hugetlbfs
mount -t hugetlbfs none /mnt/hugepages

# Use MAP_HUGETLB flag
mmap(NULL, size, PROT_READ|PROT_WRITE, 
     MAP_SHARED|MAP_HUGETLB, fd, 0);

Impact: - Reduces TLB misses by 512x (4KB→2MB) - 5-15% performance improvement for large datasets

3. NUMA Awareness¶

// Pin to local NUMA node
numa_set_localalloc();
numa_tonode_memory(addr, size, numa_node_of_cpu(cpu));

// Measure distance
int distance = numa_distance(node1, node2);
// Local: 10, Remote: 20+

Impact: - Local access: ~50ns - Remote access: ~100-150ns - 2-3x penalty for remote NUMA access

4. Prefetching¶

// Manual prefetching for random access
for (int i = 0; i < n; i++) {
    __builtin_prefetch(&array[indices[i+8]], 0, 1);
    process(array[indices[i]]);
}

Impact: 20-40% improvement for random patterns

Real-World Benchmarks¶

Particle Simulation (100K particles)¶

Traditional (message passing):
- Serialize particles:     850 µs
- Send via socket:         420 µs  
- Deserialize:            780 µs
- Total:                 2050 µs

Shared Memory:
- Write to shm_array:      12 µs  ← 170x faster!
- Read from shm_array:      8 µs
- Total:                   20 µs

Sensor Data Pipeline (1MHz sampling)¶

Traditional (pipes):
- Max throughput:      50K samples/sec
- Latency:            20-50 µs
- CPU usage:          45%

Shared Memory (ring buffer):
- Max throughput:      10M samples/sec  ← 200x higher!
- Latency:            50-100 ns         ← 400x lower!
- CPU usage:          8%

Memory Overhead¶

Table Size Configurations¶

Configuration	Table Overhead	Use Case
shm_table16 (16,16)	904 bytes	Embedded, minimal
shm_table (32,64)	4,168 bytes	Default, balanced
shm_table256 (64,256)	26,632 bytes	Complex simulations
shm_table1024 (256,1024)	422,920 bytes	Maximum flexibility

Per-Structure Overhead¶

shm_array<T>:      0 bytes (just data)
shm_queue<T>:      16 bytes (head + tail atomics)  
shm_atomic<T>:     0 bytes (just atomic)
shm_object_pool<T>: 12 bytes + N*4 bytes (free list)
shm_ring_buffer<T>: 16 bytes (read + write positions)

Scalability Analysis¶

Process Scaling¶

Readers     Throughput (ops/sec)
1           450M
2           890M  (1.98x)
4           1750M (3.89x)
8           3400M (7.56x)
16          6200M (13.8x)

Near-linear scaling for read-heavy workloads!

Contention Characteristics¶

Writers    Queue Throughput    Array Writes
1          120M ops/sec        450M ops/sec
2          95M ops/sec         380M ops/sec  
4          70M ops/sec         290M ops/sec
8          45M ops/sec         180M ops/sec

Platform-Specific Notes¶

Linux¶

Best performance with MAP_POPULATE
Use madvise(MADV_HUGEPAGE) for THP
Consider memfd_create() for anonymous shared memory

macOS¶

Limited to 4GB shared memory by default
Increase with kern.sysv.shmmax sysctl
No huge page support

FreeBSD¶

Excellent performance with minherit(INHERIT_SHARE)
Support for super pages via mmap(MAP_ALIGNED_SUPER)

Profiling & Tuning¶

Key Metrics to Monitor¶

Cache Misses

perf stat -e cache-misses,cache-references ./app

TLB Misses
```
perf stat -e dTLB-load-misses ./app
```
False Sharing
```
perf c2c record ./app
perf c2c report
```

Lock Contention

perf record -e lock:* ./app
perf report

Best Practices Summary¶

✅ DO: - Align structures to cache lines (64 bytes) - Use huge pages for datasets > 10MB - Batch operations when possible - Profile with perf on Linux - Consider NUMA topology

❌ DON'T: - Share cache lines between writers - Use atomic operations unnecessarily
- Assume uniform memory access on NUMA - Forget to handle page faults gracefully - Mix frequently/infrequently accessed data