On-callHardoc-g644

Subject Storage compaction stallLevel Mid–Senior~35 minCommon in Storage & CDN interviewsIndustries Technology

Question

A Cassandra cluster serving a write-heavy time-series workload starts rejecting and stalling writes. Symptom: client p99 write latency spikes from 8ms to 9s, with intermittent `WriteTimeoutException`s. Dashboards: pending compactions on three nodes have climbed from near-zero to 4,800 and are still rising; SSTable count per table is in the tens of thousands; disk is only 61% full so it's not a space problem; the compaction throughput throttle (`compaction_throughput_mb_per_sec`) is pinned at its configured ceiling; one node shows a single STCS compaction that's been running for 9 hours on a 600 GB SSTable. A bulk backfill job was started yesterday afternoon. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.