Code Room
On-callMediumoc-g242
Subject Database incidentsLevel Mid–Senior~30 minCommon in Databases & SQL · Reliability & on-call interviewsIndustries Technology, Software development

Question

A product-listing page is cached in Redis with a 5-minute TTL; the underlying query is an expensive aggregation against Postgres. Normally the DB is at 15% CPU. At round 5-minute marks (12:00, 12:05, ...) the DB briefly pegs at 100% CPU, query queue depth explodes, and the page p99 spikes to several seconds, then recovers. Traffic to the page is high and steady. A recent change unified the TTL across all listing keys. Triage, mitigate the spikes, and prevent recurrence.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.