Distributed Systems Engineering Is About Making Tradeoffs Explicit

October 11, 2022

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Leslie Lamport

Building distributed systems is part science, part art. The science lies in understanding fundamental principles: the math of why coordination does not scale, the inevitability of failures, the importance of invariants and feedback loops. The art is in applying those principles in balance. The single most important discipline: respect the tradeoffs and make them explicit. Know what you are trading for what. Consistency for latency, complexity for capability, optimism for risk of retry.

The core lessons fall into three categories.

Design for failures, tails, and uncertainty. It is not enough that most requests finish quickly; a few slow ones can ruin user experience. In fan-out architectures, what was a 1-in-1000 outlier becomes commonplace when you call 100 services in parallel. Engineer for the 99.9th percentile, not the median. Metastable failures, where a system enters a bad state and stays there even after the trigger is gone, have caused major outages at every large company. They often stem from features meant to improve reliability: retries, caching, aggressive concurrency. The cure is breaking the feedback loop: backpressure, load shedding, and exponential backoff with jitter. But backoff does not solve sustained overload if you have many independent clients, deferring work just shifts the surge. For prolonged heavy load, you must shed load or add capacity.

Scale out via coordination avoidance. The fastest distributed operation is the one that does not require consensus. People assume Paxos or RaftPaxos (Lamport, 1998) and Raft (Ongaro & Ousterhout, 2014) are consensus protocols that allow a cluster of machines to agree on values even when some fail. Raft was designed explicitly to be more understandable than Paxos, which is notoriously difficult to implement correctly. make systems "web-scale," but consensus gives reliability, not performance. The fundamental mechanism of scaling is partitioning work so tasks happen independently on different nodes. Minimize the portion of your system that must be synchronized; everything else can grow. Beware of accidental coordination through shared resources: a hot key or hot shard becomes a serial bottleneck regardless of how parallel your design looks.

Keep systems understandable. Simple does not mean minimal features; it means avoiding needless complexity. Identify key invariants and assert them continuously. Formal methods like TLA+TLA+ is a formal specification language created by Leslie Lamport for modeling concurrent and distributed systems. Amazon has publicly credited TLA+ with finding critical bugs in DynamoDB, S3, and other core services that conventional testing missed. catch bugs your tests miss, but they solve at most half the problem: correctness, not performance. Combine formal specs for correctness-critical logic with empirical testing for performance. Invest in observability: when something unexpected happens, treat it as a lesson. A mature distributed system is one that has assimilated countless hard lessons from production.

Optimize for the common case, plan for the worst case, and always ask "what happens if this assumption fails?" before deciding whether that failure mode is acceptable.