Spinlocks, Mutexes, and Futexes: Picking the Right Lock

"Just use a mutex" is good advice 80% of the time. The other 20% — lock-heavy hot paths, kernel code, real-time constraints — you need to know what's actually happening underneath. Three primitives cover almost everything: spinlocks, mutexes, and futexes.

They solve the same problem differently, and the cost models are not interchangeable.

The Three Strategies

What does a thread do when the lock is contended?

Spinlock:  busy-wait, hot CPU, never sleep
Mutex:     ask the kernel to put me to sleep
Futex:     spin briefly, then ask the kernel — fast-path stays in userspace

The right answer depends on how long you'll wait and whether burning a CPU is acceptable.

Spinlock

A single atomic flag, busy-waited.

typedef struct { _Atomic int locked; } spinlock_t;

void spin_lock(spinlock_t *s) {
    while (__atomic_exchange_n(&s->locked, 1, __ATOMIC_ACQUIRE)) {
        while (__atomic_load_n(&s->locked, __ATOMIC_RELAXED))
            __builtin_ia32_pause();
    }
}

void spin_unlock(spinlock_t *s) {
    __atomic_store_n(&s->locked, 0, __ATOMIC_RELEASE);
}

Best when: critical section is shorter than the cost of a context switch (~1–10µs).
Worst when: you can be preempted while holding the lock — every other waiter spins doing nothing while the OS runs something else.

This is why kernel spinlocks disable preemption (and sometimes interrupts) on the holding CPU. Userspace spinlocks lack that guarantee, which is why they're rarely the right answer in user code.

Mutex (the naive way)

A flag plus a kernel-managed wait queue. Acquire-on-contention always traps to the kernel.

lock():    syscall → kernel checks flag → atomic CAS or sleep
unlock():  syscall → kernel wakes one waiter

Every acquire and release is a system call. That's ~100ns minimum even when uncontended. The early POSIX pthread_mutex_t worked this way and was a noticeable bottleneck.

Futex: The Modern Lock

A futex (fast userspace mutex) is what every modern pthread_mutex_t, std::mutex, and Go sync.Mutex is built on. The trick: uncontended acquire and release stay entirely in userspace.

lock():
    if CAS(state, 0, 1) succeeds:   // userspace, fast path
        return
    syscall futex(WAIT)              // only on contention

unlock():
    if CAS(state, 1, 0) succeeds:    // userspace, fast path
        return
    syscall futex(WAKE)              // only if waiters present

The kernel only sees the lock when there's contention. This is why mutexes feel "free" when nobody's fighting over them — they almost are.

The full implementation (PI-aware, robust, recursive variants) is more complex, but the core insight is the userspace fast path.

Cost Comparison

Approximate, on a modern x86-64:

Operation	Spinlock	Futex (mutex)	Old kernel mutex
Uncontended acquire	~5ns	~10ns	~150ns
Uncontended release	~3ns	~10ns	~150ns
Contended acquire (sleep)	(busy)	1–3µs	1–3µs
Wakeup latency	0	1–2µs	1–2µs

Spinlock wins on uncontended speed. Futex wins on contended behavior because waiters don't burn CPU.

Adaptive Mutex

Real-world implementations cheat. A pthread_mutex_t typically:

Spins for a few hundred cycles — the next unlock is probably imminent.
Falls back to futex_wait — if still locked, sleep.

This is the best of both: short waits don't context-switch, long waits don't burn CPU.

adaptive_lock():
    for (i = 0; i < SPIN_LIMIT; i++)
        if (CAS) return;
    futex_wait(...);

When Each Wins

Use a spinlock when:

You're in kernel code or an interrupt handler — you can't sleep.
Critical section is dozens of instructions and contention is rare.
You can guarantee no preemption while holding the lock.

Use a mutex (futex) when:

You're in userspace and uncertain how long the section will run.
Contention may force waits longer than ~10µs.
You want the OS to do its job: scheduling.

Use a reader-writer lock when:

Reads dominate writes.
The critical section is non-trivial and reads can truly proceed in parallel.
(But check first — RW locks are slower than mutexes when contention is moderate.)

Things to Avoid

Don't roll your own lock. Memory barriers are easy to get wrong. Bugs are silent and irreproducible.

Don't hold a lock across a syscall. You've now coupled your critical section to the kernel's scheduling. Worst case: priority inversion.

Don't spin forever in userspace. Without preemption guarantees, the holder may not run for milliseconds.

Don't ignore the lock granularity question. Two coarse locks usually beat ten fine ones once contention shows up — cache traffic matters more than "more parallelism in theory".

Takeaways

Spinlocks are for when you can't sleep. That's the kernel and a few embedded contexts. Not most user code.
Modern userspace mutexes are futex-based: free when uncontended, kernel-assisted when contended.
Adaptive locks combine a brief spin with a futex fallback — this is what pthread_mutex does.
The right lock depends on critical-section length and contention rate. Measure both before optimizing.