On-callHardoc-g330

Subject Thundering herdLevel Senior–Staff~28 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A deploy this afternoon included a routine cache-key-format change, and the deploy script flushed the entire Redis cache to avoid serving stale-format entries. The instant the flush completed, origin DB CPU went to 100%, p99 spiked 25x, and a wave of timeouts hit — for about 2 minutes, until the cache refilled, then everything was fine. Hit rate went from 97% to 0% and climbed back. Traffic was completely normal throughout; nothing else changed. The team is now nervous about every deploy that touches the cache. How do you triage, and how do you make cache-flushing deploys safe?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.