On-callHardoc-g657

Subject Wal disk fsync tombstone read amplificationLevel Mid–Senior~35 minCommon in Storage & CDN interviewsIndustries Technology

Question

A Cassandra-backed queue-like table (rows are written, read, then deleted) develops a read-latency problem: reads on certain partitions p99 went from 5ms to 4s, and some queries fail with `TombstoneOverwhelmingException` / read timeouts. Dashboards: tombstone-scanned-per-read counts are in the tens of thousands for the slow partitions; the table is being used as a work queue (insert job, claim, delete when done); disk usage is fine and compaction is keeping up on write throughput; `gc_grace_seconds` is the default 10 days; the slow partitions are the oldest, busiest queues. No deploy; this grew over weeks. How do you triage a tombstone / delete-driven read-amplification problem on a durable store?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.