On-callHardoc-g088

Subject Network partitionLevel Senior–Staff~45 minCommon in Networking & APIs · Concurrency · Distributed systems interviewsIndustries Technology

Question

A singleton background worker (only one instance may run at a time — it processes a payout ledger) coordinates leadership via a distributed lock in etcd. At 03:30 a network partition isolates the current lock-holder node from the etcd quorum for ~20 seconds. Dashboards: during the window, etcd reports the lease as expired and a second worker instance acquires the lock and starts processing; meanwhile the original holder, still partitioned, also kept running because it never noticed it lost the lease. Afterward you find duplicate payout entries in the ledger. There were no code deploys. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.