How epoll Actually Works: Inside the Kernel's Event Loop

If you've written a high-concurrency server on Linux, you've used epoll. But "I/O multiplexing" is a vague phrase, and most explanations stop at the API. The interesting question is: how does epoll deliver O(1) readiness when select and poll were O(n)?

The answer lives in three kernel data structures.

The Problem with select and poll

Both copy a full file-descriptor set from userspace to kernelspace on every call, then walk every fd asking "is this one ready?".

select()  →  copy 10,000 fds  →  check 10,000 fds  →  return

For a server with 10k idle connections and 5 active, that's 10,000 checks to learn about 5 events. That's the O(n) tax.

The epoll Model

epoll flips the question. Instead of "ask each fd every time", the kernel pushes ready events into a list as they happen.

Three syscalls, three structures:

epoll_create1()  →  creates an eventpoll instance
epoll_ctl(ADD)   →  registers fd into the interest list (RB-tree)
epoll_wait()     →  pulls events from the ready list (linked list)

Inside the Kernel

            ┌──────────────────────────┐
            │   struct eventpoll       │
            ├──────────────────────────┤
            │  rbr     → RB-tree of    │
            │             all watched  │
            │             fds          │
            │  rdllist → ready list    │
            └──────────────────────────┘

When you call epoll_ctl(ADD, fd), the kernel:

  1. Inserts an epitem into a red-black tree keyed by (fd, file*). Lookup is O(log n).
  2. Hooks a callback (ep_poll_callback) into that fd's wait queue.

When the NIC receives data and the socket becomes readable:

  1. The socket's wait queue is woken.
  2. ep_poll_callback runs — it grabs the matching epitem and adds it to rdllist.
  3. Any thread blocked in epoll_wait is woken.

epoll_wait just splices rdllist to userspace. No scanning. Work scales with ready events, not registered events.

Edge-Triggered vs Level-Triggered

Two modes — they differ in when the kernel re-arms the notification.

c
ev.events = EPOLLIN | EPOLLET;   // edge-triggered
ev.events = EPOLLIN;             // level-triggered (default)

Level-triggered (LT): as long as the fd is readable, every epoll_wait reports it.
Edge-triggered (ET): reports only on the transition from not-ready to ready.

ET is faster (one wake per event) but unforgiving: if you don't drain the socket until EAGAIN, you'll never get notified again.

c
while (1) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n == -1 && errno == EAGAIN) break;  // drained — done
    if (n <= 0) { close(fd); break; }
    handle(buf, n);
}

The Thundering Herd

Multiple threads on the same epoll fd, one event arrives, all wake up — but only one can handle it. The losers go back to sleep.

EPOLLEXCLUSIVE (Linux 4.5+) fixes this: only one waiter wakes per event.

c
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);

Essential for multi-threaded accept loops on a single listening socket.

When epoll Isn't Enough

epoll only works on fds with an f_op->poll method — sockets, pipes, eventfd, timerfd. It does not work on regular files. That's why io_uring exists: a unified async interface for storage and network alike.

Takeaways

  • epoll is O(1) because the kernel maintains a ready list, not because it's smarter at scanning.
  • Use ET + nonblocking sockets + drain-loops for the highest throughput.
  • Use EPOLLEXCLUSIVE when sharing an epoll fd across threads.
  • Reach for io_uring when async I/O on regular files matters.