Cache Lines and False Sharing: The Hidden Tax on Multithreaded C
You wrote a multithreaded program. You gave each thread its own counter, no shared state, no locks. You scale it from 1 thread to 16 — and throughput goes down. What happened?
You hit false sharing. Different threads, different variables, but the same cache line.
What's a Cache Line?
The CPU doesn't load memory one byte at a time. It loads cache lines — typically 64 bytes on x86-64. Read one byte, the whole line comes with it.
Memory: [ ............ 64 bytes ............ ]
↑ ↑
counter_a (8 bytes) counter_b (8 bytes)
offset 0 offset 32
If counter_a and counter_b live in the same line, they share a cache line — and that's where the trouble starts.
The Cache Coherence Tax
Multicore CPUs use the MESI protocol to keep caches consistent. Each cache line is in one of four states: Modified, Exclusive, Shared, Invalid.
When Core 1 writes to its copy of a line, every other core's copy is invalidated. They must re-fetch on next access.
Core 1 writes counter_a → Core 2's line invalidated
Core 2 writes counter_b → Core 1's line invalidated
Core 1 writes counter_a → Core 2's line invalidated
... ping-pong forever ...
The variables don't overlap, but the line does. Every write costs a cache miss.
A Concrete Demo
struct counters {
long a; // used by thread 1
long b; // used by thread 2
} ctrs; // both in one 64-byte line
void *worker(void *arg) {
long *p = (long *)arg;
for (long i = 0; i < 1000000000; i++) (*p)++;
return NULL;
}
Launch two threads on &ctrs.a and &ctrs.b. Measure runtime. Now add padding:
struct counters {
long a;
char pad[56]; // pad to fill the rest of the cache line
long b;
} ctrs;
Numbers
Measured on a typical x86-64 desktop, 2 threads, 1B increments each:
| Layout | Time | Speedup |
|---|---|---|
| Same cache line | 4.2 s | 1.0x |
| Padded to 64 bytes | 1.1 s | 3.8x |
Four-fold slowdown from a layout decision. No locks involved.
Detecting It
perf shows the symptom directly:
perf stat -e cache-misses,cache-references ./your_program
A per-cache-line breakdown lives in perf c2c (cache-to-cache):
perf c2c record ./your_program
perf c2c report
It highlights HITM (hit-modified) events — the smoking gun for false sharing.
Fixing It Cleanly
Option 1: explicit padding.
struct counter {
alignas(64) long value;
char pad[64 - sizeof(long)];
};
Option 2: per-thread allocations on separate pages.
long *counter = aligned_alloc(64, sizeof(long));
Option 3: stack-local accumulators, merge once at the end. The cleanest answer when you can use it.
long local = 0;
for (...) local++;
__atomic_fetch_add(&shared, local, __ATOMIC_RELAXED);
Where It Hides
- Adjacent fields in a struct touched by different threads.
- Per-CPU stats arrays sized to
sizeof(long)per CPU instead of one cache line per CPU. - Lock-free queues where head and tail share a line — the producer and consumer fight over it on every push/pop.
- Reference counts next to data fields.
The Linux kernel sprinkles ____cacheline_aligned_in_smp all over its hot structs for exactly this reason.
Takeaways
- One cache line, two writers, two cores → ping-pong.
- Pad hot per-thread state to 64 bytes (or 128 on some ARM cores).
- Use
perf c2cto find it. The cost is real but invisible without measurement. - In hot loops, keep accumulation thread-local and merge at the end.