Question
A JVM cache/search service runs G1GC with a large 64GB heap holding a big in-memory index plus a request-scoped working set. Young/mixed GC pauses are fine (sub-20ms). But a few times an hour, p99 spikes to 2-4s and GC logs show a full GC or a long concurrent-cycle-then-Full-GC, often preceded by 'to-space exhausted' / 'Humongous Allocation' messages and a climbing 'Humongous regions' count. The service occasionally builds large byte[] buffers (multi-MB serialized result pages) per request. Heap usage is high but not growing without bound (no leak). How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.