Question
A Cassandra cluster serving a write-heavy time-series workload starts rejecting and stalling writes. Symptom: client p99 write latency spikes from 8ms to 9s, with intermittent `WriteTimeoutException`s. Dashboards: pending compactions on three nodes have climbed from near-zero to 4,800 and are still rising; SSTable count per table is in the tens of thousands; disk is only 61% full so it's not a space problem; the compaction throughput throttle (`compaction_throughput_mb_per_sec`) is pinned at its configured ceiling; one node shows a single STCS compaction that's been running for 9 hours on a 600 GB SSTable. A bulk backfill job was started yesterday afternoon. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.