Skip to main content
DevOps

Monitoring That Actually Tells You Something

November 1, 202510 min read
MonitoringSREAlertingDevOpsProductionObservability
Share:

I once inherited a Grafana instance with 47 dashboard panels. CPU utilization, memory usage, disk I/O, network bytes, JVM heap — every metric you could imagine. Everything was green. All the time.

Two days later, the API went down for 4 hours. Not a single alert fired.

Why? Because CPU was at 22%, memory at 45%, and disk at 30%. All "healthy." The actual problem was a connection pool exhaustion — a metric nobody was watching.

The Four Golden Signals (and Nothing Else)

Google's SRE book nailed this. You need exactly four signals:

1. Latency — How long do requests take? Not average latency — that hides problems. Track P50, P95, and P99:

  • P50 = 200ms means half your users get responses in 200ms (good)
  • P95 = 800ms means 1 in 20 users waits 800ms (acceptable)
  • P99 = 5000ms means 1 in 100 users waits 5 seconds (problem)

Your P99 is your real performance. The average lies.

2. Traffic — How many requests are you handling? This is your baseline. If traffic drops 80% at 2pm on a Tuesday, something is wrong even if all other metrics are green.

3. Errors — What percentage of requests fail? Track error rate, not error count. 100 errors out of 1 million requests (0.01%) is fine. 100 errors out of 200 requests (50%) is an outage.

4. Saturation — How full is your system? Database connections, memory, queue depth, thread pools. When any resource hits 80% utilization, you need to act — not because it's broken, but because you've lost your headroom.

My Actual Monitoring Setup

For the Nexural platform:

\\

Related reading

All posts →
Jason Teixeira
Written by
Jason Teixeira
Founder, Sage Ideas Studio
More about Jason →

Want to see this in action?

Check out the projects and case studies behind these articles.

livebuild 29be8ec2026-06-11 06:38Z
// solo studio// no analytics resold// every commit human-reviewed