On-callHardoc-g514

Subject Thundering herdLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your read path is fronted by a large Redis cluster that normally absorbs 98% of traffic. After a planned cluster-wide Redis restart for a version upgrade at 02:00, the cache comes back completely empty. The instant the app fleet starts serving again, origin DB CPU pins at 100%, p99 jumps 40x, and the DB shows thousands of identical queries per popular key — the warm-up never converges and the DB stays saturated for 20+ minutes. CPU on the app tier is fine. How do you triage and mitigate the cold-cache stampede?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.