On-callHardoc-g619

Subject Database hot partition shardLevel Senior–Staff~45 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology

Question

Your sharded write path (a DynamoDB table, or equivalently a Postgres-Citus / hash-sharded cluster) starts throttling: one shard/partition is at 100% of its provisioned throughput and returning throttling errors while the other 31 shards sit near 10%. p99 write latency on the affected key range spikes and a subset of users can't save. Dashboards show one partition key absorbing ~70% of traffic. Recent context: a celebrity / brand account with millions of followers just went viral, and all writes (likes, comments) are keyed by the target account id. Total cluster capacity is far from exhausted — it's purely skewed to one shard. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.