On-callHardoc-g593

Subject Durability bitrot scrubLevel Senior–Staff~40 minCommon in Storage & CDN interviewsIndustries Technology

Question

Your storage cluster's periodic integrity scrub flags a rising number of checksum mismatches: stored object checksums no longer match the data on disk for a growing set of objects, all concentrated on two older storage nodes. There's no outage and reads still 'succeed', but some clients have recently reported occasional corrupted downloads. The two nodes are past their planned hardware refresh and one has logged ECC memory errors this week. Data is 3x replicated with per-object checksums. How do you triage this silent-corruption signal, protect data integrity, and recover any damaged copies?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.