On-callHardoc-g269

Subject Canary failureLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A new release passes its 1-hour canary cleanly (error rate, latency, saturation all green; canary served 2% of traffic) and auto-promotes to 100% at 11:00. By 11:25 the metrics/monitoring backend itself starts degrading: the Prometheus remote-write queue backs up, ingestion latency climbs, and several dashboards go blank. The app's own error rate is fine. Recent context: the release added a new histogram metric labeled by `customer_id` and `endpoint`. In the 2% canary the unique-series count was modest; at 100% the active series count jumped from ~400k to ~9M and is still climbing. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.