Question
Every hour, almost exactly on the hour, your homepage feed service shows a 10–15 second cliff: cache hit rate drops from 98% to near 0, origin DB CPU spikes to saturation, p99 jumps 30x, and a burst of 500s appears — then it recovers until the next hour. The feed is cached in Redis with a 3600s TTL, and a nightly batch job repopulates many keys at the same wall-clock minute, so a large set of keys share the same expiry boundary. When they expire together, every request for those keys misses and hits the DB simultaneously. Recent context: a refactor last week consolidated cache writes into a single hourly warm job. Diagnose and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.