Question
Your sharded write path (a DynamoDB table, or equivalently a Postgres-Citus / hash-sharded cluster) starts throttling: one shard/partition is at 100% of its provisioned throughput and returning throttling errors while the other 31 shards sit near 10%. p99 write latency on the affected key range spikes and a subset of users can't save. Dashboards show one partition key absorbing ~70% of traffic. Recent context: a celebrity / brand account with millions of followers just went viral, and all writes (likes, comments) are keyed by the target account id. Total cluster capacity is far from exhausted — it's purely skewed to one shard. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.