On-callHardoc-g285

Subject RollbackLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A deploy ran a destructive migration: it merged two columns `first_name` + `last_name` into a single `full_name` column and DROPPED the originals, and the new code reads/writes only `full_name`. Twenty minutes post-deploy, a serious bug is found in the new name-handling code. The on-call's instinct is to roll back the deploy. But rolling the *code* back leaves it reading `first_name`/`last_name`, which no longer exist — so a naive rollback would turn a partial outage into a total one. Dashboards: the new code is erroring ~6% on name-render paths; the old columns are gone; a `full_name` column is fully populated. How do you reason about this and what do you actually do?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.