Monitor What Matters Not What Is Easy

July 21, 2023

Monitoring tools trick people into thinking observability is an easy problem. It is not. Most organizations drown in metrics they never look at while missing the signals that actually predict failure.

"Monitoring and observability tools commit the cardinal sin of tricking people into thinking this is an easy problem. It is very simple to monitor a small application or service. Almost none of those approaches scale." Mathew Duggan

The fundamental problem with monitoring is scope creep. Logs start as debugging aids, print statements saved to disk. Then they become the customer service tool, the audit trail, the business intelligence source of truth, the deployment verification system. Suddenly the simple syslog is mission-critical infrastructure, and the cost of monitoring an application can easily exceed the cost of hosting it. Each expansion of purpose adds requirements (retention, searchability, reliability) without anyone stepping back to ask whether this accretion of responsibilities makes sense.

Metrics follow the same pattern. Prometheus works beautifully on one server. Then you need federation, long-term storage, cross-service queries, and stakeholders outside engineering want to track customer behavior and marketing campaigns through the same pipeline. The complexity jump from "store everything and let god figure it out" to hierarchical federation or cross-service aggregation is enormous and rarely anticipated.

The discipline that matters is distinguishing between operational metrics and analytical data. Operational metrics need to be real-time, reliable, and actionable; they tell you whether to wake someone up at 3 AM. Analytical data can tolerate delay, sampling, and lower availability. Conflating "how we do things" with "how we decide which things to do" is a fatal mistake. Sample aggressively for low-priority signals. Store compliance-critical records in a dedicated system, not the logging pipeline. Set and enforce SLAs for your monitoring infrastructure itself. And above all, invest your monitoring budget in the characteristic metrics that actually predict system state transitions, not in capturing every 200-OK response for eternity.

The value of monitoring is in the decisions it enables, not the data it collects. Measure what changes your behavior, and sample or discard the rest.

Monitor What Matters Not What Is Easy

Linked from