Question
A Cassandra-backed queue-like table (rows are written, read, then deleted) develops a read-latency problem: reads on certain partitions p99 went from 5ms to 4s, and some queries fail with `TombstoneOverwhelmingException` / read timeouts. Dashboards: tombstone-scanned-per-read counts are in the tens of thousands for the slow partitions; the table is being used as a work queue (insert job, claim, delete when done); disk usage is fine and compaction is keeping up on write throughput; `gc_grace_seconds` is the default 10 days; the slow partitions are the oldest, busiest queues. No deploy; this grew over weeks. How do you triage a tombstone / delete-driven read-amplification problem on a durable store?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.