On-callHardoc-g425

Subject Hot partitionLevel Senior–Staff~35 minCommon in Distributed systems interviewsIndustries Technology

Question

Your write path uses a horizontally sharded datastore (data routed by `tenant_id` hash to 64 shards). During a product launch, one tenant goes viral. Shard 23's CPU and write latency spike to 10x the fleet median, its queue depth climbs, and a slice of traffic — all for that one big tenant — sees timeouts, while the other 63 shards sit nearly idle. Adding more shards in the past didn't help similar events. No deploy; pure traffic. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.