System designHardsd-g731

Subject Ml inferenceLevel Senior–Staff~45 minCommon in ML systems interviewsIndustries Technology

Question

Design an LLM-inference-serving system for a 70B-parameter model behind a public API with continuous (in-flight) batching. Requests are streaming chat completions with highly variable input and output lengths (some 50 tokens, some 4,000). You must maximize GPU throughput (tokens/sec/dollar) while honoring a per-request time-to-first-token SLO of 500ms p95 and a smooth token streaming rate. Traffic is bursty, the model needs multiple GPUs (tensor/pipeline parallel), and you serve a mix of short interactive and long-generation requests on the same fleet.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.