The Single-Machine Thesis¶

Inter-process shared memory on a single machine is a fundamentally different computational model than distributed systems over a network.

Abstract¶

Distributed systems research has dominated concurrent programming discourse for decades, leading developers to apply network-oriented patterns—serialization, message queues, consensus protocols—even to single-machine inter-process communication. This paper argues that shared memory IPC on a single machine represents a categorically different computational model with distinct properties: guaranteed message delivery, bounded latency, causal ordering, and the ability to share mutable state directly. Recognizing this distinction enables dramatic simplifications and orders-of-magnitude performance improvements for single-node systems.

1. The Problem with Network Thinking¶

Modern software architecture assumes network-style communication even between processes on the same machine. Consider a typical "microservices" deployment on a single server:

Process A → HTTP → localhost → HTTP → Process B
           ↓
    [JSON serialization]
    [TCP/IP stack]
    [Context switches]
    [Deserialization]

This pattern, borrowed from distributed systems, introduces unnecessary overhead:

Serialization: Converting structured data to bytes and back
Protocol overhead: HTTP headers, TCP handshakes, checksums
Kernel involvement: Socket buffers, network stack processing
Latency: Hundreds of microseconds for a localhost round-trip

But these costs exist to solve distributed systems problems that don't exist on a single machine.

2. What Networks Must Handle¶

Distributed systems face fundamental challenges that shape their design:

Challenge	Network Reality	Single Machine Reality
Delivery	Messages may be lost	Writes to RAM always succeed
Ordering	Packets arrive out of order	Memory operations have defined ordering
Latency	Unbounded, variable (ms to seconds)	Bounded, predictable (ns to μs)
Failure	Partial failure possible	Processes fail together or not at all
State	Must serialize everything	Can share pointers directly
Consensus	Requires complex protocols	Atomics provide it natively

The CAP theorem, Paxos, Raft, vector clocks, and eventual consistency all address problems that simply don't exist when processes share RAM on a single machine.

3. Properties of Single-Machine Shared Memory¶

3.1 Reliable Delivery¶

When Process A writes to shared memory address 0x7f00..., Process B will see that write. There is no packet loss, no network partition, no Byzantine failure. The write either happens (and is visible) or the entire machine has failed.

This enables lock-free algorithms that would be impossible over a network:

// This CAS operation has exactly three outcomes:
// 1. Succeeds (value was exchanged)
// 2. Fails (value was different, retry)
// 3. Process crashes (entire machine state undefined)
//
// There is no "timeout", no "partial success", no "unknown state"
while (!state.compare_exchange_weak(expected, desired)) {
    expected = state.load();
}

3.2 Bounded Latency¶

Network latency is fundamentally unbounded—a packet might arrive in 1ms or 10 seconds. This uncertainty infects all distributed algorithms:

Network: "Wait for response (timeout after ???)"
         "If timeout, was the message lost? Delivered? Acted upon?"

Shared memory latency is bounded by physics:

L3 cache hit:    ~10-20 ns
RAM access:      ~60-100 ns
Context switch:  ~1-10 μs

This bounded latency enables real-time guarantees impossible in distributed systems.

3.3 Causal Ordering¶

Modern CPUs provide memory ordering guarantees. With appropriate fences, if Process A writes X then Y, Process B observing Y will also observe X:

// Producer
data = 42;                                    // Write data
std::atomic_thread_fence(memory_order_release);
flag.store(true, memory_order_relaxed);       // Signal ready

// Consumer
while (!flag.load(memory_order_relaxed));
std::atomic_thread_fence(memory_order_acquire);
// data is guaranteed to be 42 here

No vector clocks needed. No Lamport timestamps. The hardware provides ordering.

3.4 Shared Mutable State¶

The most profound difference: processes can share pointers to the same memory. Not copies. Not serialized representations. The actual bits.

// Process A
SharedArray<double>* temps = open_array<double>("/sensors");
temps[0] = 20.5;  // Direct write to shared memory

// Process B (simultaneously)
SharedArray<double>* temps = open_array<double>("/sensors");
double t = temps[0];  // Reads 20.5 (or racing value)

This is impossible over a network. You cannot dereference a pointer across TCP.

4. What This Enables¶

4.1 Zero-Copy Communication¶

No serialization. No copying. Process A's write is Process B's read:

Traditional IPC:
  Process A memory → serialize → kernel buffer → deserialize → Process B memory
  Latency: 50-500 μs

Shared Memory:
  Process A memory == Process B memory (same physical pages)
  Latency: 50-100 ns

1000x improvement isn't optimization—it's a different computational model.

4.2 Lock-Free Everything¶

Network communication requires request-response patterns. Shared memory enables true lock-free algorithms:

// Lock-free queue in shared memory
bool push(T value) {
    uint32_t tail = tail_.load(memory_order_relaxed);
    uint32_t next = (tail + 1) % capacity;
    if (next == head_.load(memory_order_acquire)) return false;
    buffer_[tail] = value;
    tail_.store(next, memory_order_release);
    return true;
}

No locks. No blocking. No coordination protocol. Just atomic operations.

4.3 Direct Synchronization Primitives¶

Mutexes, semaphores, barriers, and condition variables can operate directly on shared memory without kernel involvement:

// Cross-process mutex using shared memory
void lock() {
    while (locked_.exchange(true, memory_order_acquire)) {
        // Spin or yield
    }
}

void unlock() {
    locked_.store(false, memory_order_release);
}

Compare to distributed locking (Redlock, ZooKeeper, etcd) which requires: - Network round-trips - Lease timeouts - Fencing tokens - Failure detection

4.4 Reactive Patterns Without Message Brokers¶

Publish-subscribe, event streams, and reactive patterns work directly:

// Signal: reactive value with version tracking
Signal<double> temperature(mem, "temp", 20.0);

// Publisher
temperature.set(25.5);  // Atomic update with version increment

// Subscriber (another process)
while (true) {
    temp.wait_for_change(last_version);  // Efficient spin-wait
    process(temp.get());
    last_version = temp.version();
}

No Kafka. No RabbitMQ. No Redis Pub/Sub. Just memory.

5. Design Implications¶

Recognizing the single-machine model changes system design:

Don't Do This (Network Thinking)¶

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Service A  │───►│ Message Q   │───►│  Service B  │
│  (Producer) │    │ (Kafka/etc) │    │  (Consumer) │
└─────────────┘    └─────────────┘    └─────────────┘
     ↓                   ↓                   ↓
  Serialize          Store/Forward      Deserialize
  ~10 μs              ~100 μs            ~10 μs

Do This (Shared Memory Thinking)¶

┌─────────────┐         ┌─────────────┐
│  Process A  │◄──────►│  Process B  │
│  (Writer)   │   │    │  (Reader)   │
└─────────────┘   │    └─────────────┘
                  │
            ┌─────▼─────┐
            │  Shared   │
            │  Memory   │
            │  Queue    │
            └───────────┘
                 ↓
             Direct read/write
             ~100 ns

Architecture Principles¶

Don't serialize when you can share: If processes are on the same machine, share memory instead of passing messages.
Don't coordinate when you can atomize: Replace distributed consensus with atomic operations.
Don't buffer when you can directly access: Skip intermediate queues; let consumers read producer memory directly.
Don't abstract the machine away: Network abstractions hide single-machine capabilities. Expose them.

6. When This Applies (and When It Doesn't)¶

Single-Machine Model Applies When:¶

All communicating processes run on one machine
Microsecond latency matters
Zero-copy performance is required
Predictable, bounded latency is needed
Processes trust each other (same security domain)

Network Model Still Needed When:¶

Processes run on different machines (obviously)
Processes may run on different machines in future
Security boundaries require isolation
Fault tolerance across machine failures is required

The Hybrid Reality¶

Most real systems are hybrid: single-machine communication between local processes, network communication between machines. The key insight is to use the right model for each hop.

┌─── Machine A ───────────────────┐     ┌─── Machine B ────────────────┐
│                                 │     │                              │
│  ┌───────┐ SHM ┌───────┐       │     │      ┌───────┐ SHM ┌───────┐ │
│  │Proc 1 │◄───►│Proc 2 │       │     │      │Proc 4 │◄───►│Proc 5 │ │
│  └───────┘     └───┬───┘       │     │      └───────┘     └───────┘ │
│                    │           │     │           ▲                   │
│                    │ Network   │     │           │                   │
│                    └───────────┼─────┼───────────┘                   │
│                                │     │                               │
└────────────────────────────────┘     └───────────────────────────────┘

Within machine: Shared memory (ns latency, zero-copy)
Between machines: Network protocol (ms latency, serialization)

7. ZeroIPC: Embracing the Model¶

ZeroIPC explicitly embraces the single-machine model:

Feature	Network-Style Would Require	ZeroIPC Provides
Data sharing	Serialize → Send → Deserialize	Direct pointer sharing
Synchronization	Distributed locks	Shared memory atomics
Queues	Message broker	Lock-free circular buffers
Events	Pub/Sub middleware	Atomic flags with spin-wait
Streams	Kafka/Pulsar	Direct ring buffers

What We Don't Provide (Intentionally)¶

Network transport: Not our problem domain
Serialization: Unnecessary for trivially-copyable types
Persistence: Shared memory is for IPC, not storage
Distribution: Would undermine our performance guarantees

8. Quantifying the Difference¶

Real benchmarks comparing single-machine communication patterns:

Method	Latency	Throughput	Notes
TCP localhost	50-100 μs	2-5 GB/s	Full network stack
Unix socket	20-50 μs	5-10 GB/s	Kernel bypass partial
Pipe	10-30 μs	3-6 GB/s	Simple but unidirectional
Shared memory	50-200 ns	40+ GB/s	Direct memory access
ZeroIPC queue	100-150 ns	8M+ ops/s	Lock-free with ordering

The 100-1000x latency improvement isn't incremental optimization. It represents access to a different computational model.

9. Conclusion¶

For decades, distributed systems thinking has dominated how we architect multi-process applications. This made sense when processes often ran on different machines. But modern applications frequently colocate processes on single machines—for containerization, security isolation, or language interoperability.

When processes share a machine, they share more than locality. They share: - A single failure domain - Bounded communication latency - Hardware-provided ordering guarantees - The ability to reference the same memory

These properties enable algorithms and performance characteristics impossible in distributed systems. By recognizing shared memory on a single machine as a distinct computational model, we can build dramatically simpler and faster systems for single-node communication.

The single-machine thesis isn't about rejecting distributed systems. It's about applying the right model to the right problem. When processes share RAM, let them share it directly.

ZeroIPC implements this thesis: a library designed specifically for single-machine IPC that embraces shared memory's unique properties rather than abstracting them away behind network-style interfaces.

References¶

Herlihy, M., & Shavit, N. (2012). The Art of Multiprocessor Programming. Morgan Kaufmann.
McKenney, P. E. (2017). Is Parallel Programming Hard, And, If So, What Can You Do About It?
Drepper, U. (2007). What Every Programmer Should Know About Memory. Red Hat.
Preshing, J. (2012). An Introduction to Lock-Free Programming. Preshing on Programming.
Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE TC.