Metastable Failures Are the Hardest to Prevent

January 24, 2022

Metastable failures are a class of distributed systems failures where the system enters a degraded state that persists even after the original trigger is removed, requiring drastic intervention to recover.

"Metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict." Huang et al., Metastable Failures in Distributed Systems

The key insight about metastable failures is the distinction between trigger and root cause. The trigger (a load spike, a network blip, a deployment) merely initiates the failure. The real root cause is a sustaining feedback loop, typically involving work amplification, that keeps the system in its degraded state. A cache goes cold, so all requests hit the database; the database slows down, so requests time out; timeouts trigger retries, which add more load; and the cycle continues. The trigger is long gone, but the system stays broken.

What makes these failures especially insidious is that the sustaining feedback loops are created by features designed to improve reliability or efficiency in the steady state. Retries, failover, caching, speculative execution: all sensible optimizations that become self-reinforcing failure amplifiers under the wrong conditions. Many production systems deliberately operate in the "vulnerable state" because the efficiency gains are enormous; a warm cache can improve throughput by 10x or more. This means the system is always one trigger away from a metastable failure.

The most productive response is not to chase triggers (that is whack-a-mole) but to weaken the strongest feedback loops. Change routing or queueing policies during overload. Make failure paths cheap: isolate expensive error handling into bounded, lock-free queues. Prioritize retried queries at a lower level so fresh user requests can succeed and break the cycle. And cultivate outlier hygiene: metastable failures often announce themselves as latency outliers long before they become full outages.

Address the sustaining feedback loop, not the trigger. There are infinite triggers but only a few feedback loops strong enough to trap your system.

Metastable Failures Are the Hardest to Prevent

Linked from