The stale leader problem

A leader cut off by a partition may not know it was replaced — and keeps writing as if nothing changed.

The idea

In a replicated system, one node is elected leader and accepts all writes; the others follow. A network partition can isolate that leader while the rest of the cluster, no longer hearing from it, elects a new one.

If the old leader keeps accepting writes — not knowing it lost its job — you have two leaders at once. That is split-brain, and its writes will be lost or conflict.

The fix is fencing: every period of leadership carries a monotonically increasing epoch (term) number. Downstream storage and followers remember the highest epoch they have seen and reject any write stamped with a lower one.

See it work

Healthy cluster. Press play, or step through it.

How it works

Consensus protocols such as Raft and Paxos attach a number to every period of leadership — Raft calls it a term, Paxos a ballot, others an epoch. The number only ever increases. To win an election a candidate must gather votes from a majority (a quorum), and the new leader's term is strictly higher than the old one's. A node stuck on the minority side of a partition can never collect a majority, so it can never legitimately advance.

That number becomes a fencing token. Lock services like Google's Chubby and Apache ZooKeeper hand out a monotonically increasing token with each lease or lock grant. The protected resource — storage, a database, a file — remembers the highest token it has honoured and refuses anything older. The old leader can try to write, but its token is stale, so the write bounces.

def on_write(write, state):
    # state.maxEpoch = highest epoch storage has ever accepted
    if write.epoch < state.maxEpoch:
        return REJECT  # fenced: a newer leader exists
    state.maxEpoch = max(state.maxEpoch, write.epoch)
    apply(write)
    return ACCEPT

The token must cross the trust boundary: it travels with the write all the way to the storage layer, which is the one component that enforces it. The leader does not get to vouch for itself.

Cost / trade-offs

What you pay	Why
Detection lag	A new leader can't be elected until the old lease times out. Longer leases mean fewer false failovers but a longer split-brain window before fencing kicks in.
Cooperating storage	Fencing only works if the downstream resource carries a monotonic token and actually checks it. You can't bolt it onto a system that ignores epochs.
Minority unavailability	Requiring a majority quorum means the minority side of a partition can't make progress. This is the CAP choice: keep consistency, give up availability there.
Extra round trips	Heartbeats and lease renewals add steady background traffic and a small latency floor on writes that revalidate leadership.

Watch out for

Heartbeats alone are not enough. A missed heartbeat tells you the leader might be gone, not that it stopped writing. Without a fencing token the old leader can keep going.
GC pauses and long stalls. A live leader frozen by a multi-second garbage-collection pause looks dead, gets replaced, then wakes up and writes with a now-expired lease. This is the classic scenario fencing exists to defeat.
Clock-based leases drift. Leases that assume synchronised clocks break under skew — one node thinks its lease is valid while the cluster has already moved on. Prefer a logical epoch over wall-clock trust.
A token nobody checks does nothing. The downstream resource must enforce the token. If storage accepts whatever arrives, the fence is decorative.
Idempotency is not fencing. Last-writer-wins and idempotent retries dedupe repeats; they don't stop a stale leader from clobbering a newer leader's data. Only an epoch comparison does.

Worked example

A lock service grants client A a lease on a shared file, with fencing token 33. A starts a long write but pauses on a garbage-collection stall. While A is frozen, the lease expires; the service grants client B the lease with the next token, 34. B writes successfully — storage records maxEpoch = 34.

Now A wakes up, still believing it holds the lock, and sends its write stamped with token 33. Storage compares: 33 < 34, so it rejects the write. A learns it was fenced. No corruption, no lost B-write, no split-brain — even though both clients briefly thought they were the leader.

Check yourself

1. A leader misses several heartbeats, so the cluster elects a new one. The old leader's network recovers and it immediately accepts a client write. What actually stops that write from corrupting data?

2. Your storage layer happily accepts every write that reaches it and never inspects the epoch token. What does fencing buy you?