On-callHardoc-g645

Subject Cache eviction storm storage originLevel Mid–Senior~30 minCommon in Storage & CDN interviewsIndustries Technology

Question

Your read path is a large Redis cache fronting a Postgres durable store for a product-catalog API. At 14:02 the catalog DB's CPU and disk-read IOPS saturate and API p99 jumps from 20ms to 2.4s. Dashboards: Redis hit-rate cratered from 94% to 31% over about two minutes; Redis `evicted_keys` spiked hard during that window; Redis memory hit `maxmemory` right before the drop; origin (Postgres) query rate went from 4k/s to 70k/s. A deploy at 13:55 added a new denormalized field to cached catalog objects, increasing each cached entry's size by roughly 3x. How do you triage, mitigate, and prevent recurrence?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.