The Kernel is the Wall: A Guide to Solarflare Onload, EF_VI, and Kernel Bypass
Introduction: The Microsecond Imperative
The Linux kernel is a marvel of engineering. It runs the internet. It runs supercomputers. It runs your toaster.
But for High-Frequency Trading (HFT)? It’s a bottleneck.
In the rarefied atmosphere of ultra-low latency systems, the speed of light is a hard constraint, but the OS kernel is a soft constraint that has hardened into a wall. If you are routing your market data through the standard Linux networking stack, you are already losing.
This guide is about Kernel Bypass. Specifically, the ecosystem built by Solarflare (now AMD): OpenOnload, TCPDirect, and the EtherFabric Virtual Interface (EF_VI).
The Linux Networking Stack: An Autopsy of Latency
To understand why we bypass the kernel, you have to understand why the kernel is slow.
The standard stack is designed for throughput and fairness. It wants to make sure your YouTube video streams smoothly while you download a Steam game. It does not care that you need to send an order in 3 microseconds.
The Interrupt-Driven Nightmare
In a standard architecture, a packet arrival triggers an interrupt (IRQ). The CPU pauses what it’s doing, saves state, switches to Kernel Mode (Ring 0), and runs the interrupt handler.
This context switch is expensive. It pollutes your L1/L2 cache. It ruins your branch prediction.
Then comes the SoftIRQ, the protocol processing (IP, TCP), the socket layer queuing, and finally the scheduler waking up your application.
By the time your app actually sees the data, microseconds have evaporated. And worse? The jitter is unpredictable.
Kernel Bypass Architecture: A Paradigm Shift
Kernel bypass is exactly what it sounds like: we fire the OS.
The OS keeps the “Control Plane” (setting up connections), but the “Data Plane” (sending/receiving packets) becomes a direct conversation between your application and the NIC.
The Result:
- Zero Copy: Data goes from NIC to App memory. No kernel buffers.
- Zero Context Switches: No Ring 0/Ring 3 transitions.
- Zero Interrupts: We poll (spin) on the hardware ring.
Solarflare OpenOnload: The “Cheat Code”
OpenOnload is the most commercially successful bypass technology for one reason: Laziness.
Porting a million lines of legacy C++ code to a new API like DPDK is a nightmare. Onload lets you cheat. You just run your standard POSIX socket application with onload ./my_app, and magic happens.
How it works
Onload uses LD_PRELOAD to intercept socket calls (recv, send, epoll). It maps the NIC hardware directly into your process address space.
When you call recv(), instead of asking the kernel, Onload checks the hardware ring directly. If it’s empty, it spins (busy-waits).
The “Spinning” Strategy:
In HFT, we don’t sleep. Sleeping means the OS scheduler takes your CPU away. Waking up takes forever. So we tell Onload to spin:
export EF_POLL_USEC=100000
This pins the CPU core to 100%, but ensures that when a packet arrives, we process it instantly.
EF_VI: The “Metal” Interface
Onload is great, but it still has overhead. It still pretends to be a socket.
For the absolute lowest latency—specifically for Market Data Feed Handlers—we use EF_VI (EtherFabric Virtual Interface).
EF_VI is raw. It’s Layer 2. You don’t get sockets. You don’t get TCP. You get raw Ethernet frames dumped into a memory buffer.
Capabilities:
- True Zero-Copy: The NIC DMAs packets directly into your memory.
- Raw Access: You parse the Ethernet/IP/UDP headers yourself.
- Hardware Filtering: You tell the NIC “send multicast group X to this queue.”
The Trade-off: EF_VI is hard. You are managing ring pointers, memory barriers, and doorbells manually. If you screw up, you crash. But it saves you about 1 microsecond compared to Onload. In HFT, that 1 microsecond is worth the pain.
Troubleshooting: The Art of the “Buffer Full”
The most common failure mode in bypass is Packet Loss.
If your app is too slow, the ring buffer fills up. The NIC has nowhere to put new packets, so it drops them.
Diagnosing with onload_stackdump:
rx_nodesc_drop_cnt: The “No Descriptor” drop. This is the killer. It means the ring was full. Your thread was either sleeping (bad config) or too slow (bad code).drop: Socket buffer full. IncreaseEF_UDP_RCVBUF.
Conclusion: The Engineer’s Choice
The Kernel is for management interfaces and SSH. OpenOnload is the workhorse. It’s fast enough for 90% of use cases and requires zero code changes. EF_VI is the Formula 1 car. It’s dangerous, uncomfortable, and requires a team of engineers to keep running. But if you need to win, it’s the only choice.
Choose your poison.