The Journey of a Received Packet: From NIC Interrupt to recv()
When recv() returns 1500 bytes, dozens of layers ran behind your back. The Linux network stack on the receive path is a relay race — hardware to driver to softirq to socket queue to your process — and every handoff is engineered to keep the cost per packet low.
Here's what happens between the wire and your buffer.
Stage 1: NIC → RAM (DMA)
The NIC has an RX ring — an array of descriptors, each pointing at a pre-allocated buffer in kernel memory. When a frame arrives:
- The NIC DMA-copies the frame into the next ring buffer.
- It updates the descriptor (length, status flags).
- It fires an MSI-X interrupt.
Wire ——> NIC —DMA—> RX ring buffer in RAM
↓
Interrupt to CPU
The CPU was never involved in copying bytes. That's the entire reason DMA exists.
Stage 2: Hard IRQ → NAPI
The interrupt handler is intentionally tiny. Doing work in hard-IRQ context blocks every other interrupt on that CPU.
static irqreturn_t nic_irq(int irq, void *data) {
napi_schedule(&priv->napi); // defer to softirq
return IRQ_HANDLED;
}
This schedules the NAPI poll — a softirq that drains the RX ring with interrupts disabled on that NIC. NAPI lets the kernel switch from interrupt-per-packet to batching when load is high.
Stage 3: Driver Poll → sk_buff
The driver's poll() walks the ring and wraps each frame in an sk_buff — the universal packet structure. From here on, every layer manipulates pointers into the same buffer.
struct sk_buff {
unsigned char *head, *data, *tail, *end;
struct net_device *dev;
__be16 protocol;
...
};
The head/data/tail/end design lets each layer pull headers off the front and push them on the front (for TX) without ever copying.
Stage 4: Protocol Demux
The driver hands the skb to netif_receive_skb(). The kernel checks skb->protocol, looks up the registered handler:
0x0800 → ip_rcv() (IPv4)
0x86DD → ipv6_rcv() (IPv6)
0x0806 → arp_rcv() (ARP)
Stage 5: IP → TCP
ip_rcv() validates the header (version, checksum, length), then routes the packet:
- Destination is local? →
ip_local_deliver() - Destination is elsewhere? →
ip_forward()
For a local TCP packet, control reaches tcp_v4_rcv(), which performs a hash lookup on (saddr, sport, daddr, dport) to find the matching socket.
Stage 6: Socket Receive Queue
tcp_v4_rcv() takes the socket lock and either:
- Fast path: connection is established, no out-of-order data, segment is in window. Append to
sk_receive_queue. - Slow path: handshake, FIN handling, retransmits, SACK —
tcp_rcv_state_process().
Finally, it calls sk->sk_data_ready() — which wakes any thread blocked in recv() or registered with epoll.
Stage 7: recv() Returns
Your process resumes. recv() walks sk_receive_queue, copies bytes into your userspace buffer, and frees consumed skbs.
NIC → DMA → RX ring → IRQ → NAPI poll → ip_rcv → tcp_v4_rcv
→ sk_receive_queue → wake → recv() → your buffer
Where Latency Lives
A breakdown of where microseconds go on the receive path:
| Stage | Typical cost |
|---|---|
| DMA + IRQ | ~1µs |
| NAPI poll dispatch | ~0.5µs |
| IP + TCP processing | 1–3µs |
| Wakeup + context switch | 2–10µs |
| Userspace copy | ~0.5µs/KB |
The context switch dominates. That's why kernel-bypass frameworks (DPDK, XDP) skip stages 4–7 entirely — user code reads directly from the RX ring.
RPS, RFS, and XDP
A few features change this picture:
- RPS (Receive Packet Steering): hash flows across CPUs in software, parallelizing stage 5+.
- RFS (Receive Flow Steering): steer each flow to the CPU running the consumer, keeping cache hot.
- XDP: a hook between stage 2 and 3. eBPF program can drop, redirect, or modify packets before any skb is allocated. The cheapest possible processing.
Takeaways
- Every received packet crosses hardware, two interrupt contexts, and a wakeup before
recv()returns. - NAPI exists because interrupt-per-packet doesn't scale past a few hundred kpps.
- The skb is the universal currency — zero-copy across layers, by design.
- If you want lower latency than this gives you, you skip the kernel.