Skip to content

Performance Guide

Memory Access Performance Analysis

The Zero-Overhead Claim

Claim: Shared memory reads are as fast as normal array reads.

Proof: Benchmarked on Intel i7-12700K @ 5.0GHz, 32GB DDR5-5600

=== Read Performance Benchmark ===
Array size: 10000 integers
Iterations: 1000000

Sequential Read Performance:
-----------------------------
Heap array (sequential):              2.324 ns/operation
Stack array (sequential):             2.654 ns/operation
Shared array operator[] (sequential): 2.318 ns/operation  ← Identical!
Shared array raw pointer (sequential): 2.316 ns/operation ← Direct access

Random Access Performance:
-----------------------------
Heap array (random):                  2.327 ns/operation
Shared array operator[] (random):     2.326 ns/operation  ← Same penalty
Shared array raw pointer (random):    2.378 ns/operation

Why It's So Fast

1. Memory Hierarchy - Identical Path

CPU Register (0 cycles)
L1 Cache (4 cycles, ~0.8ns)
L2 Cache (12 cycles, ~2.4ns)
L3 Cache (42 cycles, ~8.4ns)
Main Memory (200+ cycles, ~40ns)

Both heap and shared memory follow the exact same path.

2. Assembly Analysis

; Normal array access
mov     rax, QWORD PTR [rbp-24]  ; Load base pointer
mov     edx, DWORD PTR [rbp-28]  ; Load index
mov     eax, DWORD PTR [rax+rdx*4] ; Read array[index]

; Shared memory access (after setup)
mov     rax, QWORD PTR [rbp-32]  ; Load base pointer  
mov     edx, DWORD PTR [rbp-36]  ; Load index
mov     eax, DWORD PTR [rax+rdx*4] ; Read array[index] - IDENTICAL!

3. Cache Line Behavior

  • 64-byte cache lines loaded identically
  • Hardware prefetching works the same
  • Spatial locality preserved
  • Temporal locality preserved

Lock-Free Performance

Atomic Operations Timing

Operation x86-64 Cycles Time (5GHz) Notes
Load (relaxed) 1 0.2ns Same as normal load
Store (relaxed) 1 0.2ns Same as normal store
CAS (uncontended) 10-20 2-4ns Lock cmpxchg
CAS (contended) 100-300 20-60ns Cache line ping-pong
Fetch-Add 10-20 2-4ns Lock xadd

Queue Performance

// Measured enqueue/dequeue pairs
Single Producer/Consumer:  8-12ns per operation
Multiple Producers (4):    25-40ns per operation (contention)
Batch Operations (n=100):  2-3ns amortized per item

Object Pool Performance

// Allocation performance vs alternatives
Object Pool acquire():     10-15ns    Lock-free stack
malloc():                  40-80ns    System allocator
new T():                   45-85ns    C++ allocator
mmap():                    500-1000ns  System call

Optimization Techniques

1. Cache Line Alignment

struct alignas(64) CacheAligned {
    std::atomic<uint64_t> counter;
    char padding[56];  // Prevent false sharing
};

Impact: 10-50x improvement under contention

2. Huge Pages (2MB/1GB)

# Enable huge pages
echo 1024 > /proc/sys/vm/nr_hugepages

# Mount hugetlbfs
mount -t hugetlbfs none /mnt/hugepages

# Use MAP_HUGETLB flag
mmap(NULL, size, PROT_READ|PROT_WRITE, 
     MAP_SHARED|MAP_HUGETLB, fd, 0);

Impact: - Reduces TLB misses by 512x (4KB→2MB) - 5-15% performance improvement for large datasets

3. NUMA Awareness

// Pin to local NUMA node
numa_set_localalloc();
numa_tonode_memory(addr, size, numa_node_of_cpu(cpu));

// Measure distance
int distance = numa_distance(node1, node2);
// Local: 10, Remote: 20+

Impact: - Local access: ~50ns - Remote access: ~100-150ns - 2-3x penalty for remote NUMA access

4. Prefetching

// Manual prefetching for random access
for (int i = 0; i < n; i++) {
    __builtin_prefetch(&array[indices[i+8]], 0, 1);
    process(array[indices[i]]);
}

Impact: 20-40% improvement for random patterns

Real-World Benchmarks

Particle Simulation (100K particles)

Traditional (message passing):
- Serialize particles:     850 µs
- Send via socket:         420 µs  
- Deserialize:            780 µs
- Total:                 2050 µs

Shared Memory:
- Write to shm_array:      12 µs  ← 170x faster!
- Read from shm_array:      8 µs
- Total:                   20 µs

Sensor Data Pipeline (1MHz sampling)

Traditional (pipes):
- Max throughput:      50K samples/sec
- Latency:            20-50 µs
- CPU usage:          45%

Shared Memory (ring buffer):
- Max throughput:      10M samples/sec  ← 200x higher!
- Latency:            50-100 ns         ← 400x lower!
- CPU usage:          8%

Memory Overhead

Table Size Configurations

Configuration Table Overhead Use Case
shm_table16 (16,16) 904 bytes Embedded, minimal
shm_table (32,64) 4,168 bytes Default, balanced
shm_table256 (64,256) 26,632 bytes Complex simulations
shm_table1024 (256,1024) 422,920 bytes Maximum flexibility

Per-Structure Overhead

shm_array<T>:      0 bytes (just data)
shm_queue<T>:      16 bytes (head + tail atomics)  
shm_atomic<T>:     0 bytes (just atomic)
shm_object_pool<T>: 12 bytes + N*4 bytes (free list)
shm_ring_buffer<T>: 16 bytes (read + write positions)

Scalability Analysis

Process Scaling

Readers     Throughput (ops/sec)
1           450M
2           890M  (1.98x)
4           1750M (3.89x)
8           3400M (7.56x)
16          6200M (13.8x)

Near-linear scaling for read-heavy workloads!

Contention Characteristics

Writers    Queue Throughput    Array Writes
1          120M ops/sec        450M ops/sec
2          95M ops/sec         380M ops/sec  
4          70M ops/sec         290M ops/sec
8          45M ops/sec         180M ops/sec

Platform-Specific Notes

Linux

  • Best performance with MAP_POPULATE
  • Use madvise(MADV_HUGEPAGE) for THP
  • Consider memfd_create() for anonymous shared memory

macOS

  • Limited to 4GB shared memory by default
  • Increase with kern.sysv.shmmax sysctl
  • No huge page support

FreeBSD

  • Excellent performance with minherit(INHERIT_SHARE)
  • Support for super pages via mmap(MAP_ALIGNED_SUPER)

Profiling & Tuning

Key Metrics to Monitor

  1. Cache Misses

    perf stat -e cache-misses,cache-references ./app
    

  2. TLB Misses

    perf stat -e dTLB-load-misses ./app
    

  3. False Sharing

    perf c2c record ./app
    perf c2c report
    

  4. Lock Contention

    perf record -e lock:* ./app
    perf report
    

Best Practices Summary

DO: - Align structures to cache lines (64 bytes) - Use huge pages for datasets > 10MB - Batch operations when possible - Profile with perf on Linux - Consider NUMA topology

DON'T: - Share cache lines between writers - Use atomic operations unnecessarily
- Assume uniform memory access on NUMA - Forget to handle page faults gracefully - Mix frequently/infrequently accessed data