On-callHardoc-g314

Subject Thundering herdLevel Senior–Staff~28 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Every hour, almost exactly on the hour, your homepage feed service shows a 10–15 second cliff: cache hit rate drops from 98% to near 0, origin DB CPU spikes to saturation, p99 jumps 30x, and a burst of 500s appears — then it recovers until the next hour. The feed is cached in Redis with a 3600s TTL, and a nightly batch job repopulates many keys at the same wall-clock minute, so a large set of keys share the same expiry boundary. When they expire together, every request for those keys misses and hits the DB simultaneously. Recent context: a refactor last week consolidated cache writes into a single hourly warm job. Diagnose and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.