Question
Your file-storage product keeps object metadata (path, size, ACLs, versions) in a partitioned NoSQL store (DynamoDB-style, partition key = parent_directory_id). On-call is paged: a subset of API calls return throttling/`ProvisionedThroughputExceeded` errors and p99 metadata-read latency spiked, while overall provisioned capacity utilization across the table reads only ~35%. Dashboards: one partition's consumed throughput is pegged at 100% while the rest are nearly idle; the errors all involve operations on one enormous directory that a customer is using as a dumping ground (millions of objects under a single parent_directory_id); a new 'sync everything' client started listing that directory in a tight loop this morning. How do you triage a hot-partition saturation on the metadata store?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.