Code Room
On-callMedium
Question
A Kafka topic `webhook-delivery` (8 partitions) feeds a consumer group running on Kubernetes with an HPA targeting CPU. At 19:00 a marketing send triples inbound volume; lag climbs to 1.2M and keeps rising. Dashboards: the HPA scaled the deployment from 8 to 30 pods, CPU per pod is now LOW (~25%), yet lag is NOT draining. Only 8 pods show assigned partitions / non-zero throughput; the other 22 are idle. No errors, no rebalance thrashing. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.