Mutex contention

When many threads all want the same lock, they line up and wait — and the waiting, not the work, becomes the bottleneck.

The idea

A mutex (mutual-exclusion lock) is a turnstile in front of shared data: only one thread can pass at a time. The stretch of code it guards is the critical section. While one thread is inside, everyone else who wants in has to wait.

That waiting is contention. The lock turns parallel work into a single-file line: cores sit idle while threads take turns. The longer each thread holds the lock — or the more threads pile up behind it — the longer the queue and the worse the slowdown. The lock itself is cheap; the line is what costs you.

Critical-section length: 3 ticks

Press Play, or step through one tick at a time.

How it works

A mutex gives you exactly two operations — acquire and release — and guarantees that between them no other thread runs the guarded code. In Python the idiomatic form is a with lock: block, which acquires on entry and releases on exit (even if the body raises). The golden rule: keep the critical section as small as possible — do the slow stuff (compute, I/O) outside the lock.

import threading

lock = threading.Lock()
counter = 0

def worker():
    global counter
    for _ in range(100_000):
        payload = do_independent_work()   # runs in parallel — no lock held

        with lock:                        # critical section starts here
            counter += 1                  # the ONLY thing that must be serialized
        # lock released automatically — even on exception

        log(payload)                      # back outside the lock

Everything outside with lock: runs concurrently across cores. Only the one line that touches shared state is serialized. Widen that block — say, by moving do_independent_work() inside it — and every other thread waits longer.

Cost

Scenario	Effect
Uncontended acquire	~tens of nanoseconds — a cheap atomic compare-and-swap, no kernel call
Contended acquire	Thread parks and yields; wake-up adds a context switch (~1–5 µs), 100×+ the fast path
Cache-line bouncing	The lock word ping-pongs between cores' caches; each transfer stalls hundreds of cycles
Longer critical section	Wait time grows with hold time × number of waiters — the queue drains slower
Serialization ceiling	Amdahl's law: if a fraction s of work is locked, max speedup is capped at `1/s` no matter how many cores you add

If 10% of total work sits inside one shared lock, you can never go more than 10× faster — extra cores just queue up behind it.

Watch out for

Holding the lock during slow work. Doing I/O, a network call, or a heavy computation inside the critical section freezes every waiter for that whole duration. Prepare data outside, lock only to publish the result.
Granularity that's too coarse — or too fine. One giant lock around everything serializes unrelated work. But splitting into many tiny locks invites deadlock (lock-ordering bugs) and per-lock overhead. Aim for one lock per independently-contended resource.
False sharing and cache-line bouncing. Even a lock-free counter hurts if two cores keep writing variables on the same 64-byte cache line — the line bounces between caches. Pad hot per-thread data onto its own line.
Thundering herd on wake. Releasing a lock that dozens of threads wait on can wake them all to race for it; all but one fail and re-park. Prefer fair FIFO queueing or a single targeted wake.
Forgetting to release on exception. A manual acquire() without a matching release() in a finally deadlocks the program if the body throws. Use with lock: (Python) or RAII guards (std::lock_guard in C++) so release is automatic.

Worked example

A busy web server counts every request behind one global mutex: with request_lock: total_requests += 1. At low traffic this is invisible. At 50,000 requests a second across 16 cores, every request now serializes on that single lock — the increment is trivial, but the lock word bounces between all 16 cores' caches and threads pile into the queue. Profiling shows most CPU time spent waiting, and throughput plateaus far below what the hardware can do.

The fix is to stop sharing the hot path. Give each shard (or each core) its own counter — counters[shard_id] += 1 with no lock — and sum them only when someone actually reads the total. Or use an atomic fetch-and-add, which avoids the queue entirely for a single integer. Either way the contention disappears because there's no longer one lock that everyone must take.

Check yourself

Two threads each spend about 90% of their time inside the same lock. You add a third thread to speed things up. What happens?

Your critical section makes a slow database call while holding the lock. What's the most effective fix?