Mutex contention

When many threads all want the same lock, they line up and wait — and the waiting, not the work, becomes the bottleneck.

The idea

A mutex (mutual-exclusion lock) is a turnstile in front of shared data: only one thread can pass at a time. The stretch of code it guards is the critical section. While one thread is inside, everyone else who wants in has to wait.

That waiting is contention. The lock turns parallel work into a single-file line: cores sit idle while threads take turns. The longer each thread holds the lock — or the more threads pile up behind it — the longer the queue and the worse the slowdown. The lock itself is cheap; the line is what costs you.

Press Play, or step through one tick at a time.

How it works

A mutex gives you exactly two operations — acquire and release — and guarantees that between them no other thread runs the guarded code. In Python the idiomatic form is a with lock: block, which acquires on entry and releases on exit (even if the body raises). The golden rule: keep the critical section as small as possible — do the slow stuff (compute, I/O) outside the lock.

import threading

lock = threading.Lock()
counter = 0

def worker():
    global counter
    for _ in range(100_000):
        payload = do_independent_work()   # runs in parallel — no lock held

        with lock:                        # critical section starts here
            counter += 1                  # the ONLY thing that must be serialized
        # lock released automatically — even on exception

        log(payload)                      # back outside the lock

Everything outside with lock: runs concurrently across cores. Only the one line that touches shared state is serialized. Widen that block — say, by moving do_independent_work() inside it — and every other thread waits longer.

Cost

ScenarioEffect
Uncontended acquire~tens of nanoseconds — a cheap atomic compare-and-swap, no kernel call
Contended acquireThread parks and yields; wake-up adds a context switch (~1–5 µs), 100×+ the fast path
Cache-line bouncingThe lock word ping-pongs between cores' caches; each transfer stalls hundreds of cycles
Longer critical sectionWait time grows with hold time × number of waiters — the queue drains slower
Serialization ceilingAmdahl's law: if a fraction s of work is locked, max speedup is capped at 1/s no matter how many cores you add

If 10% of total work sits inside one shared lock, you can never go more than 10× faster — extra cores just queue up behind it.

Watch out for

Worked example

A busy web server counts every request behind one global mutex: with request_lock: total_requests += 1. At low traffic this is invisible. At 50,000 requests a second across 16 cores, every request now serializes on that single lock — the increment is trivial, but the lock word bounces between all 16 cores' caches and threads pile into the queue. Profiling shows most CPU time spent waiting, and throughput plateaus far below what the hardware can do.

The fix is to stop sharing the hot path. Give each shard (or each core) its own counter — counters[shard_id] += 1 with no lock — and sum them only when someone actually reads the total. Or use an atomic fetch-and-add, which avoids the queue entirely for a single integer. Either way the contention disappears because there's no longer one lock that everyone must take.

Check yourself

Two threads each spend about 90% of their time inside the same lock. You add a third thread to speed things up. What happens?

Your critical section makes a slow database call while holding the lock. What's the most effective fix?