Question
A deploy this afternoon included a routine cache-key-format change, and the deploy script flushed the entire Redis cache to avoid serving stale-format entries. The instant the flush completed, origin DB CPU went to 100%, p99 spiked 25x, and a wave of timeouts hit — for about 2 minutes, until the cache refilled, then everything was fine. Hit rate went from 97% to 0% and climbed back. Traffic was completely normal throughout; nothing else changed. The team is now nervous about every deploy that touches the cache. How do you triage, and how do you make cache-flushing deploys safe?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.