On-callHardoc-g641

Subject Async concurrency backpressureLevel Senior–Staff~40 minCommon in Concurrency · Algorithms & data structures interviewsIndustries Technology, Software development

Question

A streaming ingestion service consumes from Kafka and writes to a slower downstream store via an async, in-memory bounded-but-large work queue. During a traffic surge it slowly heads toward OOM: RSS climbs steadily over ~20 minutes, GC pauses lengthen, the in-flight async task queue grows into the hundreds of thousands, consumer lag rises, and eventually pods get OOM-killed and restart — losing in-flight work — then the cycle repeats. The downstream store is healthy but its write latency is ~3x the consume rate. No deploy; just sustained higher input. Explain the failure and how you stabilize it.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.