Question
At 14:50 an ops engineer ran a cleanup script intended to delete records for a single test tenant, but a missing `tenant_id` filter (a typo in a templated query) caused a `DELETE` to run unscoped against the production `documents` collection in MongoDB. Dashboards: document count for `documents` dropped from 220M to 9M in ~4 minutes; app 404 rate is climbing; the delete is still in flight (the script is paginating). PITR/continuous backups are enabled with a 7-day window; the cluster also has a 30-minute-delayed hidden replica. How do you stop the loss, recover the data, and minimize downtime?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.