How epoll Actually Works: Inside the Kernel's Event Loop
If you've written a high-concurrency server on Linux, you've used epoll. But "I/O multiplexing" is a vague phrase, and most explanations stop at the API. The interesting question is: how does epoll deliver O(1) readiness when select and poll were O(n)?
The answer lives in three kernel data structures.
The Problem with select and poll
Both copy a full file-descriptor set from userspace to kernelspace on every call, then walk every fd asking "is this one ready?".
select() → copy 10,000 fds → check 10,000 fds → return
For a server with 10k idle connections and 5 active, that's 10,000 checks to learn about 5 events. That's the O(n) tax.
The epoll Model
epoll flips the question. Instead of "ask each fd every time", the kernel pushes ready events into a list as they happen.
Three syscalls, three structures:
epoll_create1() → creates an eventpoll instance
epoll_ctl(ADD) → registers fd into the interest list (RB-tree)
epoll_wait() → pulls events from the ready list (linked list)
Inside the Kernel
┌──────────────────────────┐
│ struct eventpoll │
├──────────────────────────┤
│ rbr → RB-tree of │
│ all watched │
│ fds │
│ rdllist → ready list │
└──────────────────────────┘
When you call epoll_ctl(ADD, fd), the kernel:
- Inserts an
epiteminto a red-black tree keyed by(fd, file*). Lookup is O(log n). - Hooks a callback (
ep_poll_callback) into that fd's wait queue.
When the NIC receives data and the socket becomes readable:
- The socket's wait queue is woken.
ep_poll_callbackruns — it grabs the matchingepitemand adds it tordllist.- Any thread blocked in
epoll_waitis woken.
epoll_wait just splices rdllist to userspace. No scanning. Work scales with ready events, not registered events.
Edge-Triggered vs Level-Triggered
Two modes — they differ in when the kernel re-arms the notification.
ev.events = EPOLLIN | EPOLLET; // edge-triggered
ev.events = EPOLLIN; // level-triggered (default)
Level-triggered (LT): as long as the fd is readable, every epoll_wait reports it.
Edge-triggered (ET): reports only on the transition from not-ready to ready.
ET is faster (one wake per event) but unforgiving: if you don't drain the socket until EAGAIN, you'll never get notified again.
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n == -1 && errno == EAGAIN) break; // drained — done
if (n <= 0) { close(fd); break; }
handle(buf, n);
}
The Thundering Herd
Multiple threads on the same epoll fd, one event arrives, all wake up — but only one can handle it. The losers go back to sleep.
EPOLLEXCLUSIVE (Linux 4.5+) fixes this: only one waiter wakes per event.
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
Essential for multi-threaded accept loops on a single listening socket.
When epoll Isn't Enough
epoll only works on fds with an f_op->poll method — sockets, pipes, eventfd, timerfd. It does not work on regular files. That's why io_uring exists: a unified async interface for storage and network alike.
Takeaways
epollis O(1) because the kernel maintains a ready list, not because it's smarter at scanning.- Use ET + nonblocking sockets + drain-loops for the highest throughput.
- Use
EPOLLEXCLUSIVEwhen sharing an epoll fd across threads. - Reach for
io_uringwhen async I/O on regular files matters.