Code Room
On-callHard
Question
Your Redis cluster stores two kinds of keys: ephemeral cache entries (with TTLs) and a set of *durable* keys the team treats as a small system-of-record (feature flags, per-tenant config, in-flight idempotency tokens) that are written with no TTL. Since a traffic ramp this week, on-call reports sporadic 'config not found' and double-charged orders. Dashboards: Redis memory sits pinned at maxmemory, `evicted_keys` is climbing steadily, hit rate dropped. Context: the cluster's `maxmemory-policy` is `allkeys-lru`. How do you triage, stop the data loss, and recover?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.