Code Room
System designHardsd-g095
Subject Online inferenceLevel Senior–Staff~45 minCommon in ML systems interviewsIndustries Technology

Question

Design the online inference serving layer for a 70B-parameter LLM behind a public API. Traffic is bursty and request shapes vary wildly: some requests generate 10 tokens, some generate 4000, and prompt lengths range from 50 to 100k tokens. You must maximize expensive-GPU utilization while keeping interactive latency (time-to-first-token) acceptable. Design the request scheduler and batching strategy, explain why naive fixed-size batching fails here, and how you autoscale a fleet whose warmup takes minutes.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.