On-callMediumoc-g385

Subject Backlog buildupLevel Mid–Senior~35 minCommon in Concurrency · Distributed systems interviewsIndustries Technology

Question

A Kafka topic `webhook-delivery` (8 partitions) feeds a consumer group running on Kubernetes with an HPA targeting CPU. At 19:00 a marketing send triples inbound volume; lag climbs to 1.2M and keeps rising. Dashboards: the HPA scaled the deployment from 8 to 30 pods, CPU per pod is now LOW (~25%), yet lag is NOT draining. Only 8 pods show assigned partitions / non-zero throughput; the other 22 are idle. No errors, no rebalance thrashing. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.