Code Room
System designHard
Question
Design a lakehouse for clickstream + ad-impression data: ~20 TB/day landing as ~80M small JSON files/day from edge collectors, queried by ad-hoc Spark/Trino analysts, a daily attribution batch job, and near-real-time spend dashboards. Requirements: ACID upserts (late and corrected impressions arrive for up to 14 days), time-travel for audit, and analyst queries that scan a single day shouldn't read the whole table. Storage is object storage (S3/GCS). Design the table format, layout, and compaction strategy.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.