Question
A singleton background worker (only one instance may run at a time — it processes a payout ledger) coordinates leadership via a distributed lock in etcd. At 03:30 a network partition isolates the current lock-holder node from the etcd quorum for ~20 seconds. Dashboards: during the window, etcd reports the lease as expired and a second worker instance acquires the lock and starts processing; meanwhile the original holder, still partitioned, also kept running because it never noticed it lost the lease. Afterward you find duplicate payout entries in the ledger. There were no code deploys. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.