The Silent Performance Killer: Understanding NUMA in HFT
The Nanosecond Frontier
Is speed just about raw clock cycles? Or are we all just chasing ghosts?
In the world of High-Frequency Trading (HFT), “fast” is a relative term. To a web server, 100 milliseconds is fast. To a database, 1 millisecond is fast. But in HFT, we measure life in nanoseconds.
But here’s the kicker: when your trading algorithm is racing against light itself, the physical distance between your CPU core and its memory becomes the defining factor of profitability. It’s not just about buying the fastest hardware anymore. It’s about navigating the complex, non-uniform terrain of modern server architecture.
This post dives deep into Non-Uniform Memory Access (NUMA). To be honest, it’s the architectural reality that turns standard memory access into a minefield of latency penalties. If you aren’t engineering around it, you’re effectively leaving money on the table.
The Physics of “Here” vs. “There”
Modern servers are no longer monolithic. They are distributed systems in a box. When your CPU needs to fetch data, the cost of that fetch depends entirely on where that data lives.
- UMA (Uniform Memory Access): Old school. Any core accesses any RAM with the same latency. Rare these days.
- NUMA (Non-Uniform Memory Access): The reality. Memory is attached to specific sockets or “tiles.” Accessing your local RAM is fast; accessing your neighbor’s RAM requires traversing an interconnect bridge.
The Cost of a Hop
Traversing the interconnect (Intel UPI or AMD Infinity Fabric) isn’t free. It costs time—specifically, about 40-60 nanoseconds.
In a race where the winner takes all, a 60ns penalty on a memory fetch is effectively a “Do Not Finish” flag.
graph TD
subgraph Socket_0
Core_0[Core 0]
RAM_0
L3_0[L3 Cache]
Core_0 -->|~80ns| RAM_0
Core_0 -->|~15ns| L3_0
end
subgraph Socket_1
RAM_1
end
Core_0 -.->|UPI / Infinity Fabric +60ns| RAM_1
style Core_0 fill:#f9f,stroke:#333,stroke-width:2px
style RAM_0 fill:#bfb,stroke:#333,stroke-width:2px
style RAM_1 fill:#fbb,stroke:#333,stroke-width:2px
Hardware Landscape: Know Your Terrain
Intel and AMD solved the core-count scaling problem differently, but both solutions impose strict locality requirements.
| Feature | Intel Sapphire/Emerald Rapids | AMD EPYC Genoa/Turin |
|---|---|---|
| Topology | Tile-based Mesh. Uses 4 compute tiles connected by EMIB bridges. | Chiplet (MCM). Central I/O Die surrounded by Compute Dies (CCDs). |
| Key BIOS Setting | SNC (Sub-NUMA Clustering). Splits one socket into 2 or 4 logical NUMA nodes. | NPS (Nodes Per Socket). Splits one socket into 2 or 4 logical NUMA nodes. |
| Interconnect | UPI (Ultra Path Interconnect). | Infinity Fabric (IF). |
| Local Latency | ~82 ns (Idle) | ~103 ns (Idle) |
| Remote Latency | ~143 ns (Cross-socket) | ~220 ns (Cross-socket) |
| Latency Penalty | ~1.7x slower | ~2.1x slower |
The Takeaway: AMD offers massive core counts, but the penalty for crossing the Infinity Fabric to a remote socket is severe (>200ns). Intel’s penalty is lower, but still catastrophic for jitter-sensitive strategies.
The “Silent Killers” of Latency
You can buy the fastest 5.0 GHz processor, but if you ignore NUMA topology, you are just burning cash.
1. The Interconnect Tax
If you run your application on Socket 0 but allocate your Order Book on Socket 1, every single price update must cross the interconnect. This saturates the link and spikes latency. Simple as that.
2. False Sharing (Coherence Storms)
When two threads on different sockets write to variables that sit on the same cache line (even if the variables are different), the cache line must physically bounce between sockets.
- Result: A simple atomic increment that should take 5ns takes 300ns.
- Fix: Use
alignas(64)in C++ to pad data structures to cache line boundaries.
3. The “Housekeeper” Problem
The Linux kernel is a noisy neighbor. It runs timers, scheduler ticks, and RCU callbacks. If these land on your trading core, you experience “jitter”—random latency spikes of 5-50 microseconds.
Optimization Strategy: The Iron Laws of Locality
To win, you must strictly enforce where code runs and where memory lives.
Phase 1: BIOS Tuning
The default BIOS settings are designed for throughput (web servers), not latency.
- Node Interleaving: DISABLE. (Force the system to expose actual NUMA topology).
- Power Profile: Performance.
- C-States: DISABLE. (Prevent cores from sleeping; waking up takes microseconds).
- Sub-NUMA Clustering (SNC/NPS): ENABLE. (Break the massive socket into smaller, faster local domains).
Phase 2: OS Isolation
Evict the OS from your trading cores.
Kernel Boot Parameters:
# Isolate cores 4-63 for trading
isolcpus=4-63 # Stop scheduler from putting processes here
nohz_full=4-63 # Stop the kernel timer tick (Critical!)
rcu_nocbs=4-63 # Offload RCU callbacks to housekeeping cores
default_hugepagesz=1G # Use 1GB pages to eliminate TLB misses
hugepages=32
Phase 3: Code-Level Enforcement
Don’t hope the OS puts memory in the right place. Demand it.
C++: Allocating on a Specific NUMA Node
Use libnuma to bypass the default “first-touch” policy and strictly bind memory.
#include <numa.h>
#include <sys/mman.h>
#include <numaif.h>
// 1. Allocate 1GB Hugepage
void* addr = mmap(NULL, 1024*1024*1024, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
// 2. Strict NUMA Binding (Fail if Node 0 is full)
unsigned long nodemask = (1UL << 0); // Bitmask for Node 0
long status = mbind(addr, 1024*1024*1024, MPOL_BIND, &nodemask, 64, 0);
if (status != 0) {
// Handle error - Do not fallback to remote memory!
// In HFT, it is better to crash than to trade slowly.
exit(1);
}
Command Line: Running the Application
If you can’t modify the source, use numactl:
# Run on Cores 4-7 (Node 0), Allocate ONLY from Node 0
numactl --physcpubind=4-7 --membind=0 ./strategy_engine
Benchmarking: Trust, But Verify
Never assume your topology is what you think it is. Use Intel Memory Latency Checker (MLC) to generate a truth table for your specific server.
Example MLC Output (Sapphire Rapids):
| Node 0 | Node 1 (SNC Neighbor) | Node 2 (Remote Socket) | |
|---|---|---|---|
| Node 0 | 82.0 ns | 115.4 ns | 143.1 ns |
| Node 1 | 115.4 ns | 82.0 ns | 143.1 ns |
| Node 2 | 143.1 ns | 143.1 ns | 82.0 ns |
If you see >90ns for Local (Node 0 to Node 0), check if C-States are actually disabled.
Summary Checklist
- Map Hardware: Use
lstopoto see which PCI root your NIC is attached to. - Align NIC & CPU: Ensure your trading thread runs on the same NUMA node as your NIC.
- Isolate Cores: Use
isolcpusandnohz_full. - Bind Memory: Use
mbindornumactl --membindto prevent remote allocations. - Verify: Run
mlcto confirm latencies are within spec.
Overall, optimizing for NUMA is painful but necessary. In the nanosecond frontier, you don’t optimize for the average case. You optimize for the worst case. By respecting the physics of NUMA, you ensure that when the market moves, you actually move first.