On-callMediumoc-g623

Subject Query cache stampedeLevel Mid–Senior~35 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology

Question

At the top of every hour your primary database briefly melts: CPU spikes to 100%, query latency balloons, and the app sheds load for ~30 seconds, then recovers — like clockwork. Between spikes everything is healthy. Dashboards show the spike correlates with the Redis cache for an expensive aggregate query (a homepage 'trending items' rollup that scans millions of rows). The cache entry has a 1-hour TTL set at deploy time, so it expires for everyone simultaneously; on a miss, every app instance independently recomputes the expensive query against the DB at once. Traffic is otherwise flat. Triage and fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.