Question
Design an LLM-inference-serving system for a 70B-parameter model behind a public API with continuous (in-flight) batching. Requests are streaming chat completions with highly variable input and output lengths (some 50 tokens, some 4,000). You must maximize GPU throughput (tokens/sec/dollar) while honoring a per-request time-to-first-token SLO of 500ms p95 and a smooth token streaming rate. Traffic is bursty, the model needs multiple GPUs (tensor/pipeline parallel), and you serve a mix of short interactive and long-generation requests on the same fleet.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.