System designHardsd-g362

Subject Rag llm infraLevel Senior–Staff~50 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

Design the prompt/KV-cache layer for an LLM-serving system where most requests share a large, mostly-static prefix (a long system prompt + retrieved context) and differ only in a short user suffix. At 5k concurrent requests, recomputing the prefix's attention KV every time wastes most of the GPU. Walk through how you cache and reuse the prefix KV across requests, how you manage GPU memory as a finite resource shared by all in-flight sequences, and the correctness traps in reusing cached KV.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.