The Journey of a Received Packet: From NIC Interrupt to recv()

When recv() returns 1500 bytes, dozens of layers ran behind your back. The Linux network stack on the receive path is a relay race — hardware to driver to softirq to socket queue to your process — and every handoff is engineered to keep the cost per packet low.

Here's what happens between the wire and your buffer.

Stage 1: NIC → RAM (DMA)

The NIC has an RX ring — an array of descriptors, each pointing at a pre-allocated buffer in kernel memory. When a frame arrives:

  1. The NIC DMA-copies the frame into the next ring buffer.
  2. It updates the descriptor (length, status flags).
  3. It fires an MSI-X interrupt.
  Wire ——> NIC —DMA—> RX ring buffer in RAM
                              ↓
                       Interrupt to CPU

The CPU was never involved in copying bytes. That's the entire reason DMA exists.

Stage 2: Hard IRQ → NAPI

The interrupt handler is intentionally tiny. Doing work in hard-IRQ context blocks every other interrupt on that CPU.

c
static irqreturn_t nic_irq(int irq, void *data) {
    napi_schedule(&priv->napi);   // defer to softirq
    return IRQ_HANDLED;
}

This schedules the NAPI poll — a softirq that drains the RX ring with interrupts disabled on that NIC. NAPI lets the kernel switch from interrupt-per-packet to batching when load is high.

Stage 3: Driver Poll → sk_buff

The driver's poll() walks the ring and wraps each frame in an sk_buff — the universal packet structure. From here on, every layer manipulates pointers into the same buffer.

struct sk_buff {
    unsigned char *head, *data, *tail, *end;
    struct net_device *dev;
    __be16 protocol;
    ...
};

The head/data/tail/end design lets each layer pull headers off the front and push them on the front (for TX) without ever copying.

Stage 4: Protocol Demux

The driver hands the skb to netif_receive_skb(). The kernel checks skb->protocol, looks up the registered handler:

0x0800 → ip_rcv()        (IPv4)
0x86DD → ipv6_rcv()      (IPv6)
0x0806 → arp_rcv()       (ARP)

Stage 5: IP → TCP

ip_rcv() validates the header (version, checksum, length), then routes the packet:

  • Destination is local? → ip_local_deliver()
  • Destination is elsewhere? → ip_forward()

For a local TCP packet, control reaches tcp_v4_rcv(), which performs a hash lookup on (saddr, sport, daddr, dport) to find the matching socket.

Stage 6: Socket Receive Queue

tcp_v4_rcv() takes the socket lock and either:

  • Fast path: connection is established, no out-of-order data, segment is in window. Append to sk_receive_queue.
  • Slow path: handshake, FIN handling, retransmits, SACK — tcp_rcv_state_process().

Finally, it calls sk->sk_data_ready() — which wakes any thread blocked in recv() or registered with epoll.

Stage 7: recv() Returns

Your process resumes. recv() walks sk_receive_queue, copies bytes into your userspace buffer, and frees consumed skbs.

NIC → DMA → RX ring → IRQ → NAPI poll → ip_rcv → tcp_v4_rcv
  → sk_receive_queue → wake → recv() → your buffer

Where Latency Lives

A breakdown of where microseconds go on the receive path:

Stage Typical cost
DMA + IRQ ~1µs
NAPI poll dispatch ~0.5µs
IP + TCP processing 1–3µs
Wakeup + context switch 2–10µs
Userspace copy ~0.5µs/KB

The context switch dominates. That's why kernel-bypass frameworks (DPDK, XDP) skip stages 4–7 entirely — user code reads directly from the RX ring.

RPS, RFS, and XDP

A few features change this picture:

  • RPS (Receive Packet Steering): hash flows across CPUs in software, parallelizing stage 5+.
  • RFS (Receive Flow Steering): steer each flow to the CPU running the consumer, keeping cache hot.
  • XDP: a hook between stage 2 and 3. eBPF program can drop, redirect, or modify packets before any skb is allocated. The cheapest possible processing.

Takeaways

  • Every received packet crosses hardware, two interrupt contexts, and a wakeup before recv() returns.
  • NAPI exists because interrupt-per-packet doesn't scale past a few hundred kpps.
  • The skb is the universal currency — zero-copy across layers, by design.
  • If you want lower latency than this gives you, you skip the kernel.