On-callHardoc-g126

Subject Data lossLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

After a sudden power loss took down a Redis host at 03:00, the on-call promotes a replica and service resumes. By morning, users report lost data: session carts, recently-posted draft comments, and counters reset. Dashboards: the Redis cluster is healthy now; the promoted replica was using RDB snapshots every 5 minutes (AOF disabled); `master_repl_offset` after promotion is ~4 minutes behind the last write the app logs show being acked; the app treated Redis as the system of record for these features. How do you triage what was lost, mitigate ongoing impact, and recover what you can?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.