Question
After a sudden power loss took down a Redis host at 03:00, the on-call promotes a replica and service resumes. By morning, users report lost data: session carts, recently-posted draft comments, and counters reset. Dashboards: the Redis cluster is healthy now; the promoted replica was using RDB snapshots every 5 minutes (AOF disabled); `master_repl_offset` after promotion is ~4 minutes behind the last write the app logs show being acked; the app treated Redis as the system of record for these features. How do you triage what was lost, mitigate ongoing impact, and recover what you can?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.