On-callHardoc-g317

Subject Scaling limitsLevel Senior–Staff~30 minCommon in Distributed systems interviewsIndustries Technology

Question

Your time-series ingestion service writes to a sharded datastore keyed by `metric_name`. After onboarding a huge customer, write latency and throttling errors spiked, even though you added shards and aggregate cluster utilization is only ~40%. Dashboards show one shard at 100% while the rest idle: that customer emits one extremely high-volume metric, so nearly all its writes hash to a single partition. Adding shards didn't help — the hot key still lands on one. How do you triage and mitigate, and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.