Question
A Cassandra/RocksDB-style node starts throwing read errors on a subset of keys: queries for certain partitions return `CorruptSSTableException` / checksum-mismatch errors, and the node is logging block-checksum failures during compaction. Dashboards: read error rate on one node climbed to ~2%; the other replicas serve the same keys fine; SMART data on one NVMe drive shows rising reallocated-sector and uncorrectable-error counts; the corrupted SSTable was written 6 weeks ago and only now read; replication factor is 3. No recent deploy. How do you triage a corrupted SSTable / silent durability fault and recover without losing data or propagating the corruption?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.