Jun 18, 2024

The Silent Performance Killer: Understanding NUMA in HFT

The Nanosecond Frontier

Is speed just about raw clock cycles? Or are we all just chasing ghosts?

In the world of High-Frequency Trading (HFT), “fast” is a relative term. To a web server, 100 milliseconds is fast. To a database, 1 millisecond is fast. But in HFT, we measure life in nanoseconds.

But here’s the kicker: when your trading algorithm is racing against light itself, the physical distance between your CPU core and its memory becomes the defining factor of profitability. It’s not just about buying the fastest hardware anymore. It’s about navigating the complex, non-uniform terrain of modern server architecture.

This post dives deep into Non-Uniform Memory Access (NUMA). To be honest, it’s the architectural reality that turns standard memory access into a minefield of latency penalties. If you aren’t engineering around it, you’re effectively leaving money on the table.

The Physics of “Here” vs. “There”

Modern servers are no longer monolithic. They are distributed systems in a box. When your CPU needs to fetch data, the cost of that fetch depends entirely on where that data lives.

UMA (Uniform Memory Access): Old school. Any core accesses any RAM with the same latency. Rare these days.
NUMA (Non-Uniform Memory Access): The reality. Memory is attached to specific sockets or “tiles.” Accessing your local RAM is fast; accessing your neighbor’s RAM requires traversing an interconnect bridge.

The Cost of a Hop

Traversing the interconnect (Intel UPI or AMD Infinity Fabric) isn’t free. It costs time—specifically, about 40-60 nanoseconds.

In a race where the winner takes all, a 60ns penalty on a memory fetch is effectively a “Do Not Finish” flag.

graph TD
    subgraph Socket_0
        Core_0[Core 0]
        RAM_0
        L3_0[L3 Cache]
        Core_0 -->|~80ns| RAM_0
        Core_0 -->|~15ns| L3_0
    end

    subgraph Socket_1
        RAM_1
    end

    Core_0 -.->|UPI / Infinity Fabric +60ns| RAM_1

    style Core_0 fill:#f9f,stroke:#333,stroke-width:2px
    style RAM_0 fill:#bfb,stroke:#333,stroke-width:2px
    style RAM_1 fill:#fbb,stroke:#333,stroke-width:2px

Hardware Landscape: Know Your Terrain

Intel and AMD solved the core-count scaling problem differently, but both solutions impose strict locality requirements.

Feature	Intel Sapphire/Emerald Rapids	AMD EPYC Genoa/Turin
Topology	Tile-based Mesh. Uses 4 compute tiles connected by EMIB bridges.	Chiplet (MCM). Central I/O Die surrounded by Compute Dies (CCDs).
Key BIOS Setting	SNC (Sub-NUMA Clustering). Splits one socket into 2 or 4 logical NUMA nodes.	NPS (Nodes Per Socket). Splits one socket into 2 or 4 logical NUMA nodes.
Interconnect	UPI (Ultra Path Interconnect).	Infinity Fabric (IF).
Local Latency	~82 ns (Idle)	~103 ns (Idle)
Remote Latency	~143 ns (Cross-socket)	~220 ns (Cross-socket)
Latency Penalty	~1.7x slower	~2.1x slower

The Takeaway: AMD offers massive core counts, but the penalty for crossing the Infinity Fabric to a remote socket is severe (>200ns). Intel’s penalty is lower, but still catastrophic for jitter-sensitive strategies.

The “Silent Killers” of Latency

You can buy the fastest 5.0 GHz processor, but if you ignore NUMA topology, you are just burning cash.

1. The Interconnect Tax

If you run your application on Socket 0 but allocate your Order Book on Socket 1, every single price update must cross the interconnect. This saturates the link and spikes latency. Simple as that.

When two threads on different sockets write to variables that sit on the same cache line (even if the variables are different), the cache line must physically bounce between sockets.

Result: A simple atomic increment that should take 5ns takes 300ns.
Fix: Use alignas(64) in C++ to pad data structures to cache line boundaries.

3. The “Housekeeper” Problem

The Linux kernel is a noisy neighbor. It runs timers, scheduler ticks, and RCU callbacks. If these land on your trading core, you experience “jitter”—random latency spikes of 5-50 microseconds.

Optimization Strategy: The Iron Laws of Locality

To win, you must strictly enforce where code runs and where memory lives.

Phase 1: BIOS Tuning

The default BIOS settings are designed for throughput (web servers), not latency.

Node Interleaving: DISABLE. (Force the system to expose actual NUMA topology).
Power Profile: Performance.
C-States: DISABLE. (Prevent cores from sleeping; waking up takes microseconds).
Sub-NUMA Clustering (SNC/NPS): ENABLE. (Break the massive socket into smaller, faster local domains).

Phase 2: OS Isolation

Evict the OS from your trading cores.

Kernel Boot Parameters:

# Isolate cores 4-63 for trading
isolcpus=4-63         # Stop scheduler from putting processes here
nohz_full=4-63        # Stop the kernel timer tick (Critical!)
rcu_nocbs=4-63        # Offload RCU callbacks to housekeeping cores
default_hugepagesz=1G # Use 1GB pages to eliminate TLB misses
hugepages=32

Phase 3: Code-Level Enforcement

Don’t hope the OS puts memory in the right place. Demand it.

C++: Allocating on a Specific NUMA Node Use libnuma to bypass the default “first-touch” policy and strictly bind memory.

#include <numa.h>
#include <sys/mman.h>
#include <numaif.h>

// 1. Allocate 1GB Hugepage
void* addr = mmap(NULL, 1024*1024*1024, PROT_READ|PROT_WRITE,
                  MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

// 2. Strict NUMA Binding (Fail if Node 0 is full)
unsigned long nodemask = (1UL << 0); // Bitmask for Node 0
long status = mbind(addr, 1024*1024*1024, MPOL_BIND, &nodemask, 64, 0);

if (status != 0) {
    // Handle error - Do not fallback to remote memory!
    // In HFT, it is better to crash than to trade slowly.
    exit(1);
}

Command Line: Running the Application If you can’t modify the source, use numactl:

# Run on Cores 4-7 (Node 0), Allocate ONLY from Node 0
numactl --physcpubind=4-7 --membind=0 ./strategy_engine

Benchmarking: Trust, But Verify

Never assume your topology is what you think it is. Use Intel Memory Latency Checker (MLC) to generate a truth table for your specific server.

Example MLC Output (Sapphire Rapids):

	Node 0	Node 1 (SNC Neighbor)	Node 2 (Remote Socket)
Node 0	82.0 ns	115.4 ns	143.1 ns
Node 1	115.4 ns	82.0 ns	143.1 ns
Node 2	143.1 ns	143.1 ns	82.0 ns

If you see >90ns for Local (Node 0 to Node 0), check if C-States are actually disabled.

Summary Checklist

Map Hardware: Use lstopo to see which PCI root your NIC is attached to.
Align NIC & CPU: Ensure your trading thread runs on the same NUMA node as your NIC.
Isolate Cores: Use isolcpus and nohz_full.
Bind Memory: Use mbind or numactl --membind to prevent remote allocations.
Verify: Run mlc to confirm latencies are within spec.

Overall, optimizing for NUMA is painful but necessary. In the nanosecond frontier, you don’t optimize for the average case. You optimize for the worst case. By respecting the physics of NUMA, you ensure that when the market moves, you actually move first.