Static Stability Over Dynamic Failover

A statically stable system keeps working when a dependency becomes impaired, rather than relying on control-plane reactions to reconfigure itself after failure.

"A more effective, statically stable service would overprovision its infrastructure to the point where it would continue operating correctly without having to launch any new EC2 instances, even if an Availability Zone were to become impaired." Amazon Builders' Library

The distinction between control plane and data plane is fundamental. The control plane makes changes launching instances, updating routes, modifying configurations. The data plane handles the daily business of serving requests. Because the control plane is inherently more complex, it targets a lower availability bar than the data plane. A system that depends on the control plane to react to failures in order to keep the data plane running has coupled a high-availability requirement to a lower-availability component. This is a structural mistake.

Static stability means the data plane has everything it needs to keep working without any help from the control plane during an impairment. The EC2 data plane, for example, ensures that each physical machine has local access to all routing information for its VPC. If the control plane goes down, no new instances can be launched and no new routes can be added, but all existing traffic continues to flow. Contrast this with a design that relies on auto-scaling to react to an Availability Zone failure that approach depends on the control plane working at the exact moment it is most likely to be stressed.

Dropbox learned this lesson through years of disaster readiness work. Their ultimate test was physically unplugging a data center from the network. The systems that survived were those designed to continue operating without SJC, not those that depended on detecting the failure and reconfiguring. The practical principle: provision enough capacity across failure domains before the failure, not in response to it.

Takeaway: Design your data plane to keep serving with the capacity it already has, so it never depends on the control plane working during a crisis.


See also: Correlated Failures Are the Real Threat | Efficiency Is The Enemy of Resilience | Metastable Failures Are the Hardest to Prevent