Question
Your multi-tenant document platform enforces a per-tenant storage quota. On-call is paged: one large enterprise tenant reports that ~4% of their document uploads over the last hour are 'succeeding' in the UI but then showing as corrupt or truncated when reopened. Dashboards: that tenant just crossed its provisioned quota ceiling around the time the issue started; the storage layer's `quota_exceeded` rejections climbed from 0 to a few hundred/min; upload-success metric is near-100% but document-integrity check failures spiked; multipart uploads show some parts written and a final part rejected. No node is out of physical disk — this is a logical quota, not disk-full. How do you triage and fix mid-write quota failures producing partial/corrupt objects?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.