On-callHardoc-g334

Subject Data lossLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your Redis cluster stores two kinds of keys: ephemeral cache entries (with TTLs) and a set of *durable* keys the team treats as a small system-of-record (feature flags, per-tenant config, in-flight idempotency tokens) that are written with no TTL. Since a traffic ramp this week, on-call reports sporadic 'config not found' and double-charged orders. Dashboards: Redis memory sits pinned at maxmemory, `evicted_keys` is climbing steadily, hit rate dropped. Context: the cluster's `maxmemory-policy` is `allkeys-lru`. How do you triage, stop the data loss, and recover?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.