Question
A new release passes its 1-hour canary cleanly (error rate, latency, saturation all green; canary served 2% of traffic) and auto-promotes to 100% at 11:00. By 11:25 the metrics/monitoring backend itself starts degrading: the Prometheus remote-write queue backs up, ingestion latency climbs, and several dashboards go blank. The app's own error rate is fine. Recent context: the release added a new histogram metric labeled by `customer_id` and `endpoint`. In the 2% canary the unique-series count was modest; at 100% the active series count jumped from ~400k to ~9M and is still climbing. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.