Code Room
System designHardsd-g437
Subject BackfillLevel Senior–Staff~45 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

You must migrate 5 years of a 20 TB event history from a legacy schema/storage (old Avro on HDFS, a deprecated partition scheme) to a new lakehouse table with a redesigned schema, WHILE new events keep landing every second and production queries must keep serving the whole time. A single giant batch job has failed twice (OOM, then a transient cluster outage at hour 9) and there's no clean restart point. Design a resumable, verifiable backfill + cutover that tolerates failures and proves the migrated data is correct before you switch reads over.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.