Code Room
On-callHardoc-g467
Subject Canary failureLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology

Question

A release of the `search` service passes a clean 45-minute canary at 3% and auto-promotes to 100% at 13:00. The canary's own SLOs (latency, error rate, CPU) stayed green the whole time and stay green after promotion. But at 13:12, a SHARED downstream — the query-suggestion service used by search and four other teams — starts shedding load: its cache hit-rate collapses and its DB read load triples, and THOSE teams get paged. Context: the new search release added a `suggest()` call keyed by the full raw query string (high cardinality) instead of a normalized key. At 3% canary that added a trickle of unique keys; at 100% it blew out the suggestion cache's working set. Triage and explain why the canary missed it, then mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.