Cache Lines and False Sharing: The Hidden Tax on Multithreaded C

You wrote a multithreaded program. You gave each thread its own counter, no shared state, no locks. You scale it from 1 thread to 16 — and throughput goes down. What happened?

You hit false sharing. Different threads, different variables, but the same cache line.

What's a Cache Line?

The CPU doesn't load memory one byte at a time. It loads cache lines — typically 64 bytes on x86-64. Read one byte, the whole line comes with it.

Memory:  [ ............ 64 bytes ............ ]
           ↑                                 ↑
         counter_a (8 bytes)         counter_b (8 bytes)
         offset 0                    offset 32

If counter_a and counter_b live in the same line, they share a cache line — and that's where the trouble starts.

The Cache Coherence Tax

Multicore CPUs use the MESI protocol to keep caches consistent. Each cache line is in one of four states: Modified, Exclusive, Shared, Invalid.

When Core 1 writes to its copy of a line, every other core's copy is invalidated. They must re-fetch on next access.

Core 1 writes counter_a  →  Core 2's line invalidated
Core 2 writes counter_b  →  Core 1's line invalidated
Core 1 writes counter_a  →  Core 2's line invalidated
         ...     ping-pong forever     ...

The variables don't overlap, but the line does. Every write costs a cache miss.

A Concrete Demo

struct counters {
    long a;   // used by thread 1
    long b;   // used by thread 2
} ctrs;       // both in one 64-byte line

void *worker(void *arg) {
    long *p = (long *)arg;
    for (long i = 0; i < 1000000000; i++) (*p)++;
    return NULL;
}

Launch two threads on &ctrs.a and &ctrs.b. Measure runtime. Now add padding:

struct counters {
    long a;
    char pad[56];   // pad to fill the rest of the cache line
    long b;
} ctrs;

Numbers

Measured on a typical x86-64 desktop, 2 threads, 1B increments each:

Layout	Time	Speedup
Same cache line	4.2 s	1.0x
Padded to 64 bytes	1.1 s	3.8x

Four-fold slowdown from a layout decision. No locks involved.

Detecting It

perf shows the symptom directly:

perf stat -e cache-misses,cache-references ./your_program

A per-cache-line breakdown lives in perf c2c (cache-to-cache):

perf c2c record ./your_program
perf c2c report

It highlights HITM (hit-modified) events — the smoking gun for false sharing.

Fixing It Cleanly

Option 1: explicit padding.

struct counter {
    alignas(64) long value;
    char pad[64 - sizeof(long)];
};

Option 2: per-thread allocations on separate pages.

long *counter = aligned_alloc(64, sizeof(long));

Option 3: stack-local accumulators, merge once at the end. The cleanest answer when you can use it.

long local = 0;
for (...) local++;
__atomic_fetch_add(&shared, local, __ATOMIC_RELAXED);

Where It Hides

Adjacent fields in a struct touched by different threads.
Per-CPU stats arrays sized to sizeof(long) per CPU instead of one cache line per CPU.
Lock-free queues where head and tail share a line — the producer and consumer fight over it on every push/pop.
Reference counts next to data fields.

The Linux kernel sprinkles ____cacheline_aligned_in_smp all over its hot structs for exactly this reason.

Takeaways

One cache line, two writers, two cores → ping-pong.
Pad hot per-thread state to 64 bytes (or 128 on some ARM cores).
Use perf c2c to find it. The cost is real but invisible without measurement.
In hot loops, keep accumulation thread-local and merge at the end.