Bypassing the Kernel: How DPDK Hits 10 Million Packets per Second

A modern Linux network stack tops out around 1–2 million packets per second per core. Past that, you're paying for things you don't need: interrupts, context switches, sk_buff allocations, layer demux. DPDK — the Data Plane Development Kit — throws all of that away.

The result is 10–20 Mpps per core. Here's why.

What Kernel-Bypass Means

Three things change:

The NIC is unbound from the kernel driver and bound to a userspace driver (vfio-pci or igb_uio).
Packets DMA directly into a userspace memory pool — no kernel involvement.
The application polls the RX ring instead of waiting for interrupts.

Kernel path:    NIC → IRQ → NAPI → sk_buff → IP → TCP → socket → recv()
DPDK path:      NIC → RX ring (mmap'd) → your code

Every hop the kernel does is a hop you skip.

The Four Things DPDK Avoids

1. System calls. Polling means no read/recv/epoll_wait. Each syscall costs ~100ns of mode-switch.

2. Per-packet allocations. Every kernel skb is a kmalloc plus a kfree. DPDK uses rte_mempool — a per-core lockless cache of pre-allocated rte_mbuf structures.

3. Cache misses. Hugepages (2MB or 1GB) reduce TLB misses dramatically. NUMA awareness keeps packets, descriptors, and code on the same socket.

4. Locks. Each NIC queue is owned by exactly one core. No contention, no atomics on the hot path.

Minimum DPDK App

#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>

#define NUM_MBUFS 8191
#define BURST_SIZE 32

int main(int argc, char **argv) {
    rte_eal_init(argc, argv);

    struct rte_mempool *mp = rte_pktmbuf_pool_create("MBUF_POOL",
        NUM_MBUFS, 250, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

    rte_eth_dev_configure(0, 1, 1, NULL);
    rte_eth_rx_queue_setup(0, 0, 1024, rte_socket_id(), NULL, mp);
    rte_eth_tx_queue_setup(0, 0, 1024, rte_socket_id(), NULL);
    rte_eth_dev_start(0);

    struct rte_mbuf *pkts[BURST_SIZE];
    while (1) {
        uint16_t n = rte_eth_rx_burst(0, 0, pkts, BURST_SIZE);
        for (uint16_t i = 0; i < n; i++) {
            // process pkts[i]->buf_addr
            rte_pktmbuf_free(pkts[i]);
        }
    }
}

rte_eth_rx_burst returns up to 32 packets per call. The 1–32 packet batch is the secret — amortize the per-call cost across many packets.

Why Batching Wins

A single packet costs roughly the same as 32 packets in DPDK because most of the cost is per-call, not per-byte. Polling, ring index updates, prefetches — they all amortize.

  per-call overhead: ~50ns
  per-packet work:   ~10ns (just copy + classify)

  1 packet:   60ns total → 16.7 Mpps
  32 packets: 50 + 320 = 370ns → 86 Mpps theoretical (per core)

Real numbers depend on the NIC, but the shape of the curve is identical everywhere.

CPU Pinning and Isolation

Polling cores must be 100% busy. To prevent the scheduler from stealing them:

isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7

Combined with taskset -c 2 ./app, this gives a core that runs nothing but DPDK — no kernel ticks, no migrations, no IPIs.

DPDK vs XDP

XDP is the kernel's answer to "give me kernel-bypass speed without leaving the kernel."

Property	DPDK	XDP
Throughput	Highest (10–20 Mpps/core)	Very high (5–10 Mpps/core)
Programming model	C, full control	eBPF, restricted
Operational cost	Dedicated cores	Shares with kernel stack
TCP stack	Bring your own	Falls back to kernel for non-XDP traffic
Hot reload	Restart	Replace eBPF program live

XDP wins for filtering, DDoS mitigation, simple L4 redirection. DPDK wins when you're building a switch, router, or full userspace TCP/IP stack.

What You Give Up

No tcpdump (use dpdk-pdump). No iptables. No kernel TCP — you must use a userspace stack like F-Stack or VPP. The NIC is invisible to ip and ifconfig while bound to vfio-pci. Operationally, DPDK boxes are different.

When to Reach for It

You're a load balancer (HAProxy at 40Gbps, Katran).
You're an NFV element (firewall, NAT, proxy) on commodity x86.
You're at line-rate on 100GbE and the kernel is the bottleneck.
Tight microseconds matter — HFT, cell-tower telemetry, packet capture.

If you're building a regular web service, the kernel stack is fine. DPDK exists for when it isn't.

Takeaways

The kernel network stack is a generalist. DPDK is a specialist.
Speedups come from removing work, not doing it faster: no syscalls, no allocations, no demux, no locks.
Polling cores must be isolated and dedicated; no half-measures.
The tradeoff is ecosystem: every kernel network feature is something you reimplement.