Question
A release of the `search` service passes a clean 45-minute canary at 3% and auto-promotes to 100% at 13:00. The canary's own SLOs (latency, error rate, CPU) stayed green the whole time and stay green after promotion. But at 13:12, a SHARED downstream — the query-suggestion service used by search and four other teams — starts shedding load: its cache hit-rate collapses and its DB read load triples, and THOSE teams get paged. Context: the new search release added a `suggest()` call keyed by the full raw query string (high cardinality) instead of a normalized key. At 3% canary that added a trickle of unique keys; at 100% it blew out the suggestion cache's working set. Triage and explain why the canary missed it, then mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.