The Single-Machine Thesis¶
Inter-process shared memory on a single machine is a fundamentally different computational model than distributed systems over a network.
Abstract¶
Distributed systems research has dominated concurrent programming discourse for decades, leading developers to apply network-oriented patterns—serialization, message queues, consensus protocols—even to single-machine inter-process communication. This paper argues that shared memory IPC on a single machine represents a categorically different computational model with distinct properties: guaranteed message delivery, bounded latency, causal ordering, and the ability to share mutable state directly. Recognizing this distinction enables dramatic simplifications and orders-of-magnitude performance improvements for single-node systems.
1. The Problem with Network Thinking¶
Modern software architecture assumes network-style communication even between processes on the same machine. Consider a typical "microservices" deployment on a single server:
Process A → HTTP → localhost → HTTP → Process B
↓
[JSON serialization]
[TCP/IP stack]
[Context switches]
[Deserialization]
This pattern, borrowed from distributed systems, introduces unnecessary overhead:
- Serialization: Converting structured data to bytes and back
- Protocol overhead: HTTP headers, TCP handshakes, checksums
- Kernel involvement: Socket buffers, network stack processing
- Latency: Hundreds of microseconds for a localhost round-trip
But these costs exist to solve distributed systems problems that don't exist on a single machine.
2. What Networks Must Handle¶
Distributed systems face fundamental challenges that shape their design:
| Challenge | Network Reality | Single Machine Reality |
|---|---|---|
| Delivery | Messages may be lost | Writes to RAM always succeed |
| Ordering | Packets arrive out of order | Memory operations have defined ordering |
| Latency | Unbounded, variable (ms to seconds) | Bounded, predictable (ns to μs) |
| Failure | Partial failure possible | Processes fail together or not at all |
| State | Must serialize everything | Can share pointers directly |
| Consensus | Requires complex protocols | Atomics provide it natively |
The CAP theorem, Paxos, Raft, vector clocks, and eventual consistency all address problems that simply don't exist when processes share RAM on a single machine.
3. Properties of Single-Machine Shared Memory¶
3.1 Reliable Delivery¶
When Process A writes to shared memory address 0x7f00..., Process B will see that write. There is no packet loss, no network partition, no Byzantine failure. The write either happens (and is visible) or the entire machine has failed.
This enables lock-free algorithms that would be impossible over a network:
// This CAS operation has exactly three outcomes:
// 1. Succeeds (value was exchanged)
// 2. Fails (value was different, retry)
// 3. Process crashes (entire machine state undefined)
//
// There is no "timeout", no "partial success", no "unknown state"
while (!state.compare_exchange_weak(expected, desired)) {
expected = state.load();
}
3.2 Bounded Latency¶
Network latency is fundamentally unbounded—a packet might arrive in 1ms or 10 seconds. This uncertainty infects all distributed algorithms:
Network: "Wait for response (timeout after ???)"
"If timeout, was the message lost? Delivered? Acted upon?"
Shared memory latency is bounded by physics:
This bounded latency enables real-time guarantees impossible in distributed systems.
3.3 Causal Ordering¶
Modern CPUs provide memory ordering guarantees. With appropriate fences, if Process A writes X then Y, Process B observing Y will also observe X:
// Producer
data = 42; // Write data
std::atomic_thread_fence(memory_order_release);
flag.store(true, memory_order_relaxed); // Signal ready
// Consumer
while (!flag.load(memory_order_relaxed));
std::atomic_thread_fence(memory_order_acquire);
// data is guaranteed to be 42 here
No vector clocks needed. No Lamport timestamps. The hardware provides ordering.
3.4 Shared Mutable State¶
The most profound difference: processes can share pointers to the same memory. Not copies. Not serialized representations. The actual bits.
// Process A
SharedArray<double>* temps = open_array<double>("/sensors");
temps[0] = 20.5; // Direct write to shared memory
// Process B (simultaneously)
SharedArray<double>* temps = open_array<double>("/sensors");
double t = temps[0]; // Reads 20.5 (or racing value)
This is impossible over a network. You cannot dereference a pointer across TCP.
4. What This Enables¶
4.1 Zero-Copy Communication¶
No serialization. No copying. Process A's write is Process B's read:
Traditional IPC:
Process A memory → serialize → kernel buffer → deserialize → Process B memory
Latency: 50-500 μs
Shared Memory:
Process A memory == Process B memory (same physical pages)
Latency: 50-100 ns
1000x improvement isn't optimization—it's a different computational model.
4.2 Lock-Free Everything¶
Network communication requires request-response patterns. Shared memory enables true lock-free algorithms:
// Lock-free queue in shared memory
bool push(T value) {
uint32_t tail = tail_.load(memory_order_relaxed);
uint32_t next = (tail + 1) % capacity;
if (next == head_.load(memory_order_acquire)) return false;
buffer_[tail] = value;
tail_.store(next, memory_order_release);
return true;
}
No locks. No blocking. No coordination protocol. Just atomic operations.
4.3 Direct Synchronization Primitives¶
Mutexes, semaphores, barriers, and condition variables can operate directly on shared memory without kernel involvement:
// Cross-process mutex using shared memory
void lock() {
while (locked_.exchange(true, memory_order_acquire)) {
// Spin or yield
}
}
void unlock() {
locked_.store(false, memory_order_release);
}
Compare to distributed locking (Redlock, ZooKeeper, etcd) which requires: - Network round-trips - Lease timeouts - Fencing tokens - Failure detection
4.4 Reactive Patterns Without Message Brokers¶
Publish-subscribe, event streams, and reactive patterns work directly:
// Signal: reactive value with version tracking
Signal<double> temperature(mem, "temp", 20.0);
// Publisher
temperature.set(25.5); // Atomic update with version increment
// Subscriber (another process)
while (true) {
temp.wait_for_change(last_version); // Efficient spin-wait
process(temp.get());
last_version = temp.version();
}
No Kafka. No RabbitMQ. No Redis Pub/Sub. Just memory.
5. Design Implications¶
Recognizing the single-machine model changes system design:
Don't Do This (Network Thinking)¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service A │───►│ Message Q │───►│ Service B │
│ (Producer) │ │ (Kafka/etc) │ │ (Consumer) │
└─────────────┘ └─────────────┘ └─────────────┘
↓ ↓ ↓
Serialize Store/Forward Deserialize
~10 μs ~100 μs ~10 μs
Do This (Shared Memory Thinking)¶
┌─────────────┐ ┌─────────────┐
│ Process A │◄──────►│ Process B │
│ (Writer) │ │ │ (Reader) │
└─────────────┘ │ └─────────────┘
│
┌─────▼─────┐
│ Shared │
│ Memory │
│ Queue │
└───────────┘
↓
Direct read/write
~100 ns
Architecture Principles¶
-
Don't serialize when you can share: If processes are on the same machine, share memory instead of passing messages.
-
Don't coordinate when you can atomize: Replace distributed consensus with atomic operations.
-
Don't buffer when you can directly access: Skip intermediate queues; let consumers read producer memory directly.
-
Don't abstract the machine away: Network abstractions hide single-machine capabilities. Expose them.
6. When This Applies (and When It Doesn't)¶
Single-Machine Model Applies When:¶
- All communicating processes run on one machine
- Microsecond latency matters
- Zero-copy performance is required
- Predictable, bounded latency is needed
- Processes trust each other (same security domain)
Network Model Still Needed When:¶
- Processes run on different machines (obviously)
- Processes may run on different machines in future
- Security boundaries require isolation
- Fault tolerance across machine failures is required
The Hybrid Reality¶
Most real systems are hybrid: single-machine communication between local processes, network communication between machines. The key insight is to use the right model for each hop.
┌─── Machine A ───────────────────┐ ┌─── Machine B ────────────────┐
│ │ │ │
│ ┌───────┐ SHM ┌───────┐ │ │ ┌───────┐ SHM ┌───────┐ │
│ │Proc 1 │◄───►│Proc 2 │ │ │ │Proc 4 │◄───►│Proc 5 │ │
│ └───────┘ └───┬───┘ │ │ └───────┘ └───────┘ │
│ │ │ │ ▲ │
│ │ Network │ │ │ │
│ └───────────┼─────┼───────────┘ │
│ │ │ │
└────────────────────────────────┘ └───────────────────────────────┘
Within machine: Shared memory (ns latency, zero-copy)
Between machines: Network protocol (ms latency, serialization)
7. ZeroIPC: Embracing the Model¶
ZeroIPC explicitly embraces the single-machine model:
| Feature | Network-Style Would Require | ZeroIPC Provides |
|---|---|---|
| Data sharing | Serialize → Send → Deserialize | Direct pointer sharing |
| Synchronization | Distributed locks | Shared memory atomics |
| Queues | Message broker | Lock-free circular buffers |
| Events | Pub/Sub middleware | Atomic flags with spin-wait |
| Streams | Kafka/Pulsar | Direct ring buffers |
What We Don't Provide (Intentionally)¶
- Network transport: Not our problem domain
- Serialization: Unnecessary for trivially-copyable types
- Persistence: Shared memory is for IPC, not storage
- Distribution: Would undermine our performance guarantees
8. Quantifying the Difference¶
Real benchmarks comparing single-machine communication patterns:
| Method | Latency | Throughput | Notes |
|---|---|---|---|
| TCP localhost | 50-100 μs | 2-5 GB/s | Full network stack |
| Unix socket | 20-50 μs | 5-10 GB/s | Kernel bypass partial |
| Pipe | 10-30 μs | 3-6 GB/s | Simple but unidirectional |
| Shared memory | 50-200 ns | 40+ GB/s | Direct memory access |
| ZeroIPC queue | 100-150 ns | 8M+ ops/s | Lock-free with ordering |
The 100-1000x latency improvement isn't incremental optimization. It represents access to a different computational model.
9. Conclusion¶
For decades, distributed systems thinking has dominated how we architect multi-process applications. This made sense when processes often ran on different machines. But modern applications frequently colocate processes on single machines—for containerization, security isolation, or language interoperability.
When processes share a machine, they share more than locality. They share: - A single failure domain - Bounded communication latency - Hardware-provided ordering guarantees - The ability to reference the same memory
These properties enable algorithms and performance characteristics impossible in distributed systems. By recognizing shared memory on a single machine as a distinct computational model, we can build dramatically simpler and faster systems for single-node communication.
The single-machine thesis isn't about rejecting distributed systems. It's about applying the right model to the right problem. When processes share RAM, let them share it directly.
ZeroIPC implements this thesis: a library designed specifically for single-machine IPC that embraces shared memory's unique properties rather than abstracting them away behind network-style interfaces.
References¶
- Herlihy, M., & Shavit, N. (2012). The Art of Multiprocessor Programming. Morgan Kaufmann.
- McKenney, P. E. (2017). Is Parallel Programming Hard, And, If So, What Can You Do About It?
- Drepper, U. (2007). What Every Programmer Should Know About Memory. Red Hat.
- Preshing, J. (2012). An Introduction to Lock-Free Programming. Preshing on Programming.
- Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE TC.