Question
A streaming ingestion service consumes from Kafka and writes to a slower downstream store via an async, in-memory bounded-but-large work queue. During a traffic surge it slowly heads toward OOM: RSS climbs steadily over ~20 minutes, GC pauses lengthen, the in-flight async task queue grows into the hundreds of thousands, consumer lag rises, and eventually pods get OOM-killed and restart — losing in-flight work — then the cycle repeats. The downstream store is healthy but its write latency is ~3x the consume rate. No deploy; just sustained higher input. Explain the failure and how you stabilize it.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.