On-callHardoc-g331

Subject CorruptionLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A feature-store service packs each user's float32 feature vector into a fixed-width binary column and reads it back on every scoring request. At 15:00 a deploy bumped the serialization helper to a new internal library. No errors, no latency change. By 17:00, ML eng reports that scores for users who were *re-written* after 15:00 are wildly off (e.g. a churn probability of 0.02 reads back as ~3.2e38), while users not touched since the deploy score normally. Dashboards: write rate normal, deserialize error rate flat at 0. Context: the new library defaults to little-endian packing; the reader still assumes big-endian (the old default). How do you triage, stop further corruption, and recover the affected vectors?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.