On-callHardoc-g119

Subject Data lossLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 14:50 an ops engineer ran a cleanup script intended to delete records for a single test tenant, but a missing `tenant_id` filter (a typo in a templated query) caused a `DELETE` to run unscoped against the production `documents` collection in MongoDB. Dashboards: document count for `documents` dropped from 220M to 9M in ~4 minutes; app 404 rate is climbing; the delete is still in flight (the script is paginating). PITR/continuous backups are enabled with a 7-day window; the cluster also has a 30-minute-delayed hidden replica. How do you stop the loss, recover the data, and minimize downtime?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.