When many threads all want the same lock, they line up and wait — and the waiting, not the work, becomes the bottleneck.
A mutex (mutual-exclusion lock) is a turnstile in front of shared data: only one thread can pass at a time. The stretch of code it guards is the critical section. While one thread is inside, everyone else who wants in has to wait.
That waiting is contention. The lock turns parallel work into a single-file line: cores sit idle while threads take turns. The longer each thread holds the lock — or the more threads pile up behind it — the longer the queue and the worse the slowdown. The lock itself is cheap; the line is what costs you.
A mutex gives you exactly two operations — acquire and release — and guarantees that between them no other thread runs the guarded code. In Python the idiomatic form is a with lock: block, which acquires on entry and releases on exit (even if the body raises). The golden rule: keep the critical section as small as possible — do the slow stuff (compute, I/O) outside the lock.
import threading
lock = threading.Lock()
counter = 0
def worker():
global counter
for _ in range(100_000):
payload = do_independent_work() # runs in parallel — no lock held
with lock: # critical section starts here
counter += 1 # the ONLY thing that must be serialized
# lock released automatically — even on exception
log(payload) # back outside the lock
Everything outside with lock: runs concurrently across cores. Only the one line that touches shared state is serialized. Widen that block — say, by moving do_independent_work() inside it — and every other thread waits longer.
| Scenario | Effect |
|---|---|
| Uncontended acquire | ~tens of nanoseconds — a cheap atomic compare-and-swap, no kernel call |
| Contended acquire | Thread parks and yields; wake-up adds a context switch (~1–5 µs), 100×+ the fast path |
| Cache-line bouncing | The lock word ping-pongs between cores' caches; each transfer stalls hundreds of cycles |
| Longer critical section | Wait time grows with hold time × number of waiters — the queue drains slower |
| Serialization ceiling | Amdahl's law: if a fraction s of work is locked, max speedup is capped at 1/s no matter how many cores you add |
If 10% of total work sits inside one shared lock, you can never go more than 10× faster — extra cores just queue up behind it.
acquire() without a matching release() in a finally deadlocks the program if the body throws. Use with lock: (Python) or RAII guards (std::lock_guard in C++) so release is automatic.A busy web server counts every request behind one global mutex: with request_lock: total_requests += 1. At low traffic this is invisible. At 50,000 requests a second across 16 cores, every request now serializes on that single lock — the increment is trivial, but the lock word bounces between all 16 cores' caches and threads pile into the queue. Profiling shows most CPU time spent waiting, and throughput plateaus far below what the hardware can do.
The fix is to stop sharing the hot path. Give each shard (or each core) its own counter — counters[shard_id] += 1 with no lock — and sum them only when someone actually reads the total. Or use an atomic fetch-and-add, which avoids the queue entirely for a single integer. Either way the contention disappears because there's no longer one lock that everyone must take.
Two threads each spend about 90% of their time inside the same lock. You add a third thread to speed things up. What happens?
Your critical section makes a slow database call while holding the lock. What's the most effective fix?