Cache Lines and False Sharing: The Hidden Tax on Multithreaded C

You wrote a multithreaded program. You gave each thread its own counter, no shared state, no locks. You scale it from 1 thread to 16 — and throughput goes down. What happened?

You hit false sharing. Different threads, different variables, but the same cache line.

What's a Cache Line?

The CPU doesn't load memory one byte at a time. It loads cache lines — typically 64 bytes on x86-64. Read one byte, the whole line comes with it.

Memory:  [ ............ 64 bytes ............ ]
           ↑                                 ↑
         counter_a (8 bytes)         counter_b (8 bytes)
         offset 0                    offset 32

If counter_a and counter_b live in the same line, they share a cache line — and that's where the trouble starts.

The Cache Coherence Tax

Multicore CPUs use the MESI protocol to keep caches consistent. Each cache line is in one of four states: Modified, Exclusive, Shared, Invalid.

When Core 1 writes to its copy of a line, every other core's copy is invalidated. They must re-fetch on next access.

Core 1 writes counter_a  →  Core 2's line invalidated
Core 2 writes counter_b  →  Core 1's line invalidated
Core 1 writes counter_a  →  Core 2's line invalidated
         ...     ping-pong forever     ...

The variables don't overlap, but the line does. Every write costs a cache miss.

A Concrete Demo

c
struct counters {
    long a;   // used by thread 1
    long b;   // used by thread 2
} ctrs;       // both in one 64-byte line

void *worker(void *arg) {
    long *p = (long *)arg;
    for (long i = 0; i < 1000000000; i++) (*p)++;
    return NULL;
}

Launch two threads on &ctrs.a and &ctrs.b. Measure runtime. Now add padding:

c
struct counters {
    long a;
    char pad[56];   // pad to fill the rest of the cache line
    long b;
} ctrs;

Numbers

Measured on a typical x86-64 desktop, 2 threads, 1B increments each:

Layout Time Speedup
Same cache line 4.2 s 1.0x
Padded to 64 bytes 1.1 s 3.8x

Four-fold slowdown from a layout decision. No locks involved.

Detecting It

perf shows the symptom directly:

perf stat -e cache-misses,cache-references ./your_program

A per-cache-line breakdown lives in perf c2c (cache-to-cache):

perf c2c record ./your_program
perf c2c report

It highlights HITM (hit-modified) events — the smoking gun for false sharing.

Fixing It Cleanly

Option 1: explicit padding.

c
struct counter {
    alignas(64) long value;
    char pad[64 - sizeof(long)];
};

Option 2: per-thread allocations on separate pages.

c
long *counter = aligned_alloc(64, sizeof(long));

Option 3: stack-local accumulators, merge once at the end. The cleanest answer when you can use it.

c
long local = 0;
for (...) local++;
__atomic_fetch_add(&shared, local, __ATOMIC_RELAXED);

Where It Hides

  • Adjacent fields in a struct touched by different threads.
  • Per-CPU stats arrays sized to sizeof(long) per CPU instead of one cache line per CPU.
  • Lock-free queues where head and tail share a line — the producer and consumer fight over it on every push/pop.
  • Reference counts next to data fields.

The Linux kernel sprinkles ____cacheline_aligned_in_smp all over its hot structs for exactly this reason.

Takeaways

  • One cache line, two writers, two cores → ping-pong.
  • Pad hot per-thread state to 64 bytes (or 128 on some ARM cores).
  • Use perf c2c to find it. The cost is real but invisible without measurement.
  • In hot loops, keep accumulation thread-local and merge at the end.