Skip to content

NUMA & Topology

When marketing materials say “8-GPU server,” they imply 8 equivalent GPUs sitting in a row. The reality is a topology: a graph of links with very different bandwidths. Two GPUs might be connected by an at 600 GB/s. Two others might only talk via PCIe at 64 GB/s. Two more might have to route through the CPU and the cross-socket interconnect, dropping to 10 GB/s.

The same is true on the CPU side. A dual-socket server is two computers in a box, each with its own attached RAM. The other socket’s memory is technically reachable, but it costs ~2× the latency of your local memory. This is (Non-Uniform Memory Access).

This lesson is about the layered hierarchy — caches → DRAM → NUMA → PCIe → NVLink → InfiniBand → CXL — and why every parallelism decision in modern training is really a topology decision in disguise. Place tensor parallelism on a slow link instead of a fast one, and your throughput drops 60×.

TL;DR

  • A modern server is multiple sockets (CPUs), each with its own memory controller. Memory attached to one socket is local to it; memory on the other socket is remote — accessing it costs ~2× more latency and ~50% bandwidth.
  • This is (Non-Uniform Memory Access). Linux exposes it via /sys/devices/system/node/. Tools: numactl, lstopo, numastat. Pinning processes to NUMA nodes can be a 1.5–3× win on memory-bound workloads.
  • The same idea scales to GPUs: a multi-GPU node has a topology ( mesh, , PCIe). GPU 0 and GPU 1 might be 600 GB/s apart (NVLink); GPU 0 and GPU 7 might be 200 GB/s (one NVSwitch hop).
  • Across nodes, / Ethernet add another tier: 25–400 Gb/s between nodes. The topology of a frontier training cluster matters as much as raw GPU count.
  • (Compute Express Link, 2019+) is the emerging fabric for cache-coherent shared memory across CPUs and accelerators. Production rollout 2024–2026; reshaping how multi-host AI systems are built.

The bandwidth hierarchy

Bandwidth varies by 10× across the topology hierarchy. Where data lives determines what speed you actually run at. Get the parallelism-to-topology mapping wrong and you pay the slowest link’s cost on every layer.

CPU NUMA basics

A dual-socket Xeon server: each socket has 4–8 memory channels feeding ~200 GB/s of local DRAM bandwidth. Memory across the socket boundary travels via UPI (Ultra Path Interconnect) at ~40 GB/s. 5× bandwidth gap; 2× latency gap.

$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0-23 48-71 node 0 size: 192 GB node 1 cpus: 24-47 72-95 node 1 size: 192 GB node distances: node 0 1 0: 10 21 1: 21 10

The “distance” matrix: 10 = local (1.0×), 21 = remote (2.1× the latency).

A process spawned on node 1 but allocating from node 0’s memory pays the cross-socket cost on every memory access. Pinning fixes this:

numactl --cpunodebind=0 --membind=0 ./my_process

Or in code:

#include <numa.h> numa_run_on_node(0); numa_set_preferred(0);

For ML workloads, the typical setup is one process per NUMA node, GPU bindings matching CPU bindings (“GPU 0–3 on socket 0” type configs).

lstopo — see your machine

$ lstopo --of png > topology.png

lstopo (from hwloc) generates a diagram showing CPU cores, caches, memory channels, PCIe links, GPUs. The single best tool for understanding “what is connected to what” on a server. Reading it once is enough for most decisions.

GPU topology

nvidia-smi topo --matrix:

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X NV4 NV4 NV4 NV4 NV4 NV4 NV4 GPU1 NV4 X NV4 NV4 NV4 NV4 NV4 NV4 ...

NV4 = 4 NVLink connections direct (intra-NVSwitch fabric). On older topologies you might see:

  • NV2 — 2 NVLink connections.
  • PIX — same PCIe switch.
  • PHB — PCIe Host Bridge (across CPU socket).
  • SYS — across NUMA nodes via CPU.

A PIX-connected pair has ~64 GB/s; a SYS-connected pair drops to ~10 GB/s. NCCL automatically uses the best path but you can hint via NCCL_TOPO_DUMP_FILE and NCCL_NET_GDR_*.

NVSwitch

NVIDIA’s H100 / H200 / B200 nodes use — a chip that gives every-GPU-to-every-GPU full-bandwidth connectivity. 600 GB/s any-to-any within a node. This is what makes TP=8 work efficiently across 8 H100s.

Pre-NVSwitch (e.g., DGX-1 V100): GPUs were connected pairwise via NVLinks; some pairs had 4× links, some had 2×. AllReduce algorithms had to be topology-aware.

For Blackwell GB200 (NVL72), the NVSwitch fabric extends across 72 GPUs with full NVLink bandwidth between any pair. Effectively turns 72 GPUs into “one big GPU” from the comm perspective. Why GB200 NVL72 racks dominate frontier training conversations in 2025–2026.

Cross-node networking

Two big options for connecting nodes:

  • (IB): 200–400 Gb/s. Lower latency, RDMA-native. Standard in AI clusters.
  • RoCEv2 ( over Ethernet): 200–400 Gb/s on Ethernet hardware. Cheaper; Meta uses this at scale.

Both support GPU-Direct RDMA: GPU memory on one node can be read directly by another node’s GPU without crossing the host CPU. Roughly half the bandwidth of NVLink (~50–100 GB/s) but vastly better than PCIe-only paths.

Topology-aware comm (NCCL on H100 + InfiniBand): DP and PP traffic tunneled over IB; TP stays within NVSwitch. This is exactly the parallelism-mesh decision from Tensor Parallel — pinned by topology.

CXL — the next layer

(Compute Express Link, 1.x in 2019, 3.x in 2024) is a cache-coherent fabric that lets CPUs and accelerators share memory at PCIe-link speeds. Three classes:

  • CXL.io: PCIe-equivalent device I/O.
  • CXL.cache: device caches host memory (or vice versa) coherently.
  • CXL.mem: extend host memory with attached CXL memory devices.

Production CXL (PCIe Gen 5 / 6, ~64 / 128 GB/s) lets you build systems with shared memory pools spanning hundreds of GBs across many hosts. Big AI training stacks are starting to evaluate CXL as an alternative to (or complement of) . As of April 2026, deployment is still early — but it’s the architecture conversation worth tracking.

Production checklist

For any new multi-GPU deployment:

  1. nvidia-smi topo --matrix — read the GPU connectivity.
  2. numactl --hardware + lstopo — read the CPU/memory topology.
  3. Pin processes: one per NUMA node, GPU affinity matching socket affinity.
  4. Verify NCCL is using the right transport (NCCL_DEBUG=INFO).
  5. Match parallelism dims to topology: TP within NVSwitch, PP / DP across.
  6. For >1 node: confirm IB / RoCE is in use (not falling back to TCP).

Do these once during system setup; they’re foundational to every later perf tuning.

Run it in your browser — topology cost simulator

Python — editableCompute the time to AllReduce 1 GB across 8 GPUs given different topologies.
Ctrl+Enter to run

The shape — orders-of-magnitude bandwidth differences across the topology — is what every training-systems decision encodes.

Quick check

Fill in the blank
The Linux command-line tool for binding a process to a specific NUMA node:
Eight letters; combines numa + ctl.
Quick check
A team trains a 70B model on 16 H100s split across 2 nodes (NVLink intra-node, 200 Gb InfiniBand inter-node). Throughput is 40% of expected. Topology check that matters first:

Key takeaways

  1. NUMA on CPUs: local memory is fast, remote is ~2× slower. Pin processes with numactl.
  2. GPU topology: NVLink/NVSwitch within node; PCIe/IB across. ~60× bandwidth gap from best to worst path.
  3. Parallelism-mesh decisions follow topology: TP within NVLink, PP across nodes, DP everywhere.
  4. lstopo, numactl --hardware, nvidia-smi topo --matrix are the diagnostic commands. Run them on every new system.
  5. CXL is the emerging fabric for cache-coherent shared memory across hosts; watch it through 2026–2027.

Go deeper

TL;DR

  • A modern server is multiple sockets (CPUs), each with its own memory controller. Memory attached to one socket is local to it; memory on the other socket is remote — accessing it costs ~2× more latency and ~50% bandwidth.
  • This is NUMA (Non-Uniform Memory Access). Linux exposes it via /sys/devices/system/node/. Tools: numactl, lstopo, numastat. Pinning processes to NUMA nodes can be a 1.5–3× win on memory-bound workloads.
  • The same idea scales to GPUs: a multi-GPU node has a topology (NVLink mesh, NVSwitch, PCIe). GPU 0 and GPU 1 might be 600 GB/s apart (NVLink); GPU 0 and GPU 7 might be 200 GB/s (one NVSwitch hop).
  • Across nodes, InfiniBand / Ethernet RDMA add another tier: 25–400 Gb/s between nodes. The topology of a frontier training cluster matters as much as raw GPU count.
  • CXL (Compute Express Link, 2019+) is the emerging fabric for cache-coherent shared memory across CPUs and accelerators. Production rollout 2024–2026; reshaping how multi-host AI systems are built.

Why this matters

A 70B-model serving deployment that ignores topology can be 2× slower than the same hardware properly configured. Frontier training stacks (TorchTitan, Megatron-Core, NeMo) all encode topology into their parallelism mesh — TP within NVLink, PP across nodes, etc. Knowing how to read lstopo output, what numactl --cpunodebind does, and how NCCL discovers GPU topology is the foundation for any serious multi-GPU work.

Mental model

Bandwidth varies by 10× across the topology hierarchy. Where data lives determines what speed you actually run at.

Concrete walkthrough

CPU NUMA basics

A dual-socket Xeon server: each socket has 4–8 memory channels feeding ~200 GB/s of local DRAM bandwidth. Memory across the socket boundary travels via UPI (Ultra Path Interconnect) at ~40 GB/s. 5× bandwidth gap; 2× latency gap.

$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0-23 48-71 node 0 size: 192 GB node 1 cpus: 24-47 72-95 node 1 size: 192 GB node distances: node 0 1 0: 10 21 1: 21 10

The “distance” matrix: 10 = local (1.0×), 21 = remote (2.1× the latency).

A process spawned on node 1 but allocating from node 0’s memory pays the cross-socket cost on every memory access. Pinning fixes this:

numactl --cpunodebind=0 --membind=0 ./my_process

Or in code:

#include <numa.h> numa_run_on_node(0); numa_set_preferred(0);

For ML workloads, the typical setup is one process per NUMA node, GPU bindings matching CPU bindings (“GPU 0–3 on socket 0” type configs).

lstopo — see your machine

$ lstopo --of png > topology.png

lstopo (from hwloc) generates a diagram showing CPU cores, caches, memory channels, PCIe links, GPUs. The single best tool for understanding “what is connected to what” on a server. Reading it once is enough for most decisions.

GPU topology

nvidia-smi topo --matrix:

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X NV4 NV4 NV4 NV4 NV4 NV4 NV4 GPU1 NV4 X NV4 NV4 NV4 NV4 NV4 NV4 ...

NV4 = 4 NVLink connections direct (intra-NVSwitch fabric). On older topologies you might see:

  • NV2 — 2 NVLink connections.
  • PIX — same PCIe switch.
  • PHB — PCIe Host Bridge (across CPU socket).
  • SYS — across NUMA nodes via CPU.

A PIX-connected pair has ~64 GB/s; a SYS-connected pair drops to ~10 GB/s. NCCL automatically uses the best path but you can hint via NCCL_TOPO_DUMP_FILE and NCCL_NET_GDR_*.

NVIDIA’s H100 / H200 / B200 nodes use NVSwitch — a chip that gives every-GPU-to-every-GPU full-bandwidth connectivity. 600 GB/s any-to-any within a node. This is what makes TP=8 work efficiently across 8 H100s.

Pre-NVSwitch (e.g., DGX-1 V100): GPUs were connected pairwise via NVLinks; some pairs had 4× links, some had 2×. AllReduce algorithms had to be topology-aware.

For Blackwell GB200 (NVL72), the NVSwitch fabric extends across 72 GPUs with full NVLink bandwidth between any pair. Effectively turns 72 GPUs into “one big GPU” from the comm perspective. Why GB200 NVL72 racks dominate frontier training conversations in 2025–2026.

Cross-node networking

Two big options for connecting nodes:

  • InfiniBand (IB): 200–400 Gb/s. Lower latency, RDMA-native. Standard in AI clusters.
  • RoCEv2 (RDMA over Ethernet): 200–400 Gb/s on Ethernet hardware. Cheaper; Meta uses this at scale.

Both support GPU-Direct RDMA: GPU memory on one node can be read directly by another node’s GPU without crossing the host CPU. Roughly half the bandwidth of NVLink (~50–100 GB/s) but vastly better than PCIe-only paths.

Topology-aware comm (NCCL on H100 + InfiniBand): DP and PP traffic tunneled over IB; TP stays within NVSwitch. This is exactly the parallelism-mesh decision from Tensor Parallel — pinned by topology.

CXL — the next layer

CXL (Compute Express Link, 1.x in 2019, 3.x in 2024) is a cache-coherent fabric that lets CPUs and accelerators share memory at PCIe-link speeds. Three classes:

  • CXL.io: PCIe-equivalent device I/O.
  • CXL.cache: device caches host memory (or vice versa) coherently.
  • CXL.mem: extend host memory with attached CXL memory devices.

Production CXL (PCIe Gen 5 / 6, ~64 / 128 GB/s) lets you build systems with shared memory pools spanning hundreds of GBs across many hosts. Big AI training stacks are starting to evaluate CXL as an alternative to (or complement of) RDMA. As of April 2026, deployment is still early — but it’s the architecture conversation worth tracking.

Production checklist

For any new multi-GPU deployment:

  1. nvidia-smi topo --matrix — read the GPU connectivity.
  2. numactl --hardware + lstopo — read the CPU/memory topology.
  3. Pin processes: one per NUMA node, GPU affinity matching socket affinity.
  4. Verify NCCL is using the right transport (NCCL_DEBUG=INFO).
  5. Match parallelism dims to topology: TP within NVSwitch, PP / DP across.
  6. For >1 node: confirm IB / RoCE is in use (not falling back to TCP).

Do these once during system setup; they’re foundational to every later perf tuning.

Run it in your browser — topology cost simulator

Python — editableCompute the time to AllReduce 1 GB across 8 GPUs given different topologies.
Ctrl+Enter to run

The shape — orders-of-magnitude bandwidth differences across the topology — is what every training-systems decision encodes. Get the parallelism-to-topology mapping wrong and you’re paying the cross-NUMA / Ethernet cost on every layer.

Quick check

Fill in the blank
The Linux command-line tool for binding a process to a specific NUMA node:
Eight letters; combines numa + ctl.
Quick check
A team trains a 70B model on 16 H100s split across 2 nodes (NVLink intra-node, 200 Gb InfiniBand inter-node). Throughput is 40% of expected. Topology check that matters first:

Key takeaways

  1. NUMA on CPUs: local memory is fast, remote is ~2× slower. Pin processes with numactl.
  2. GPU topology: NVLink/NVSwitch within node; PCIe/IB across. ~60× bandwidth gap from best to worst path.
  3. Parallelism-mesh decisions follow topology: TP within NVLink, PP across nodes, DP everywhere.
  4. lstopo, numactl --hardware, nvidia-smi topo --matrix are the diagnostic commands. Run them on every new system.
  5. CXL is the emerging fabric for cache-coherent shared memory across hosts; watch it through 2026–2027.

Go deeper