Skip to content

Networking

Distributed training has two mental layers. The upper layer is recipes: data-parallel, tensor-parallel, pipeline-parallel, FSDP, expert-parallel. The lower layer — the one that actually moves bytes between GPUs — is collectives, topology, and tracing. Most engineers can recite the recipes; far fewer can debug a hang at 3 a.m. when one rank goes silent and the rest stall on AllReduce. This module is the lower layer.

The discipline is the same as the kernels half of Mosaic: predict, then verify. Predict the bandwidth from the topology and the message size; verify with NCCL’s debug log; reconcile every gap. A senior distributed-training engineer can look at a 16-rank job’s NCCL_DEBUG=INFO output and tell you in 30 seconds whether the ring topology is rational, whether PXN is engaged, whether the hang is a shape mismatch or a cgroup OOM. This module builds that fluency.

0 / 1 lessons~18 min total