Code Room
On-callHardoc-g021
Subject Query of deathLevel Senior–Staff~40 minCommon in Databases & SQL interviewsIndustries Technology, Software development

Question

At 17:05 your Elasticsearch-backed search service starts crash-looping: data nodes hit OOM and restart, the cluster goes yellow then red, and as each node restarts it dies again within a minute. APM shows that just before each crash there's a single very expensive query — a deeply nested aggregation with a huge terms cardinality and a wide date range — submitted repeatedly by one client. Because it's persisted to a saved-search dashboard, it keeps getting re-issued. How do you triage and break the loop?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.