Code Room
On-callMediumoc-g582
Subject Backup silent failureLevel Mid–Senior~30 minCommon in Storage & CDN · Reliability & on-call interviewsIndustries Technology

Question

During a routine restore drill you discover the latest usable database backup is 6 days old, even though the backup job is scheduled nightly and its dashboard tile is green. Digging in: the cron ran every night and exited 0, but the uploaded artifacts in the backup bucket stop 6 days ago, and the job logs show 'partial upload, retrying' warnings that never escalated. Six days ago the team rotated the object-storage credentials and bumped the backup tool's version. There is no active outage, but you are now well outside your stated RPO. How do you triage, restore your backup coverage immediately, and make sure this can't silently recur?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.