Skip to content

Data Parallel & DDP

Prereq: Backprop as a Graph. DDP is what wraps backprop when you have more than one GPU.

In a managed cloud, “scale out” means another instance behind the load balancer. The instances don’t know about each other. The load balancer hands one request per worker, the workers respond, the system scales horizontally. The trick that makes it look simple is that there’s no shared state — workers handle independent requests.

Multi-GPU training is not that. Every GPU runs the same step on a different slice of the same batch, and once they’re done with backward, they have to agree on the gradient before the optimizer can step — otherwise the model copies drift apart and the run diverges.

The agreement primitive is : every GPU contributes its local gradient tensor, and every GPU receives the average across the whole pool. implements it as Ring-AllReduce — the bandwidth-optimal algorithm that moves gradients around the GPU ring in two passes (reduce-scatter, all-gather). PyTorch’s wraps your model in one line and injects async AllReduce calls into backward.

Two engineering tricks turn DDP from “parallel-but-slow” into “70%+ efficient up to ~32 GPUs”: bucketing (group thousands of small per-parameter AllReduces into a few large ones) and comm-compute overlap (start AllReduce on layer N’s gradients while still computing backward on layer N−1). Both are invisible from the user’s point of view — model = DDP(model) and the framework does the rest. But every distributed-training failure mode at scale traces back to one of these going wrong, so you need to know what they are.

TL;DR

  • Data parallel (DP) = each GPU holds a full copy of the model weights, processes a different slice of the batch, then averages gradients across GPUs at backward time.
  • PyTorch DDP is the canonical implementation. Wraps your model in one line; the framework adds AllReduce calls during backward.
  • The communication primitive is Ring-AllReduce — gradients flow around the GPU ring in two passes (reduce-scatter, all-gather). Cost: 2 × (N-1) × params / N bytes per GPU per step.
  • Gradient bucketing groups small parameter tensors into fixed-size buckets so the AllReduce traffic happens in a few large transfers instead of many tiny ones. ~10× higher achieved bandwidth.
  • Comm-compute overlap — start AllReduce on layer N’s gradients while still computing backward on layer N-1. This is what gets DDP from “parallel but not scaling” to “70%+ efficient up to ~32 GPUs”.

Mental model

Forward and backward run in parallel; the AllReduce is the only synchronization.

Wrap your model

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group(backend="nccl") torch.cuda.set_device(local_rank) model = MyModel().to(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop is the same for batch in dataloader: out = model(batch) loss = criterion(out, target) loss.backward() # DDP injects AllReduce here optimizer.step()

That’s it. The framework intercepts the autograd backward, accumulates gradients per parameter, and at appropriate moments launches an AllReduce that averages the gradients across all participating GPUs.

Ring-AllReduce, in pictures

For N GPUs in a logical ring, an AllReduce of P parameters proceeds in two phases:

Reduce-Scatter: each GPU starts with the full P parameters’ worth of local gradients. They split into N chunks of P/N each. Then for N-1 rounds:

  • Each GPU sends one chunk to the next GPU in the ring.
  • Each GPU receives one chunk from the previous GPU and adds it to its local copy.

After N-1 rounds, each GPU has the fully-summed gradient for one of the N chunks.

All-Gather: each GPU now broadcasts its summed chunk to the rest of the ring, in another N-1 rounds.

Total bytes moved per GPU: 2 × (N-1)/N × P × bytes_per_param. For 8 GPUs, that’s about 1.75 × P × bytes_per_param.

This is bandwidth-optimal — no algorithm can beat it within a constant factor — and is what NCCL implements under the hood.

Bucket the gradients

Naive DDP would call AllReduce once per parameter tensor. For a 7B model with thousands of named parameters, that’s thousands of small AllReduces — disastrous bandwidth utilization (~10% of peak NVLink).

DDP buckets parameters into fixed-size groups (default: 25 MB):

model = DDP(model, device_ids=[local_rank], bucket_cap_mb=25)

Now there are ~50–100 large AllReduces per step instead of 10000. Achieved bandwidth jumps to ~80–95% of peak. For most models, the default bucket size is fine; only tune it if profiling shows comm time dominating.

Overlap comm with compute

A naive backward pass runs all backward ops, then does the AllReduce. Comm and compute serialize → wall time = compute + comm.

Real DDP overlaps them: as soon as a layer’s gradients are computed, its AllReduce launches asynchronously. While the AllReduce is in flight on the network, the GPU keeps computing the next layer’s backward.

# Conceptual; DDP does this internally def backward_with_overlap(): for layer in reversed(layers): grad = layer.backward() # Launch async AllReduce on this layer's grad bucket if full if bucket_full(layer): handle = dist.all_reduce(bucket, async_op=True) handles.append(handle) for h in handles: h.wait()

Wall time becomes max(compute, comm) instead of compute + comm. For most models with reasonable network bandwidth, this hides most of the AllReduce — getting you from 50% scaling efficiency to 80%+.

Cost model

Per-step bytes per GPU for AllReduce:

bytes_per_step = 2 × (N-1)/N × params × bytes_per_grad

For a 7B model in BF16 across 8 GPUs:

bytes = 2 × 7/8 × 7e9 × 2 = ~24 GB / GPU / step

At 50 GB/s effective inter-GPU bandwidth (NVLink Gen 4), that’s ~500 ms of pure communication per step. For a step with 200 ms of compute, comm dominates and DDP scales poorly. For larger models or tight networks, you graduate to FSDP, TP, or PP — covered in the next lessons.

The crossover for typical configs: DDP works well up to ~32 GPUs; beyond, communication starts to dominate and you need to shard.

When DDP wins

  • Model fits on a single GPU. If the model + activations + optimizer state don’t fit, you can’t replicate; you need /.
  • Batch size scales well. Some training tasks (small models, small dataset) hit batch-size diminishing returns long before DDP saturates.
  • Network is good. NVLink within a node is fine; cross-node Ethernet at 25 Gbps is where DDP starts to break.

Practical knobs

DDP( model, device_ids=[local_rank], bucket_cap_mb=25, # default 25; rarely needs tuning find_unused_parameters=False, # set True only if conditional execution gradient_as_bucket_view=True, # save memory; default in modern PyTorch static_graph=True, # if no conditional ops; better overlap )

find_unused_parameters=True is a footgun — it adds bookkeeping overhead even when not needed. Default to False; only enable if you actually have conditional graph branches.

Run it in your browser — DDP cost model simulator

Python — editablePlug in your model and network; see expected per-step communication time and scaling efficiency.
Ctrl+Enter to run

You’ll see efficiency hold above 80% up to ~16 GPUs for the 7B, then collapse beyond ~32 GPUs as comm time exceeds compute. That’s exactly the regime where you stop using pure DDP and graduate to FSDP / TP.

Quick check

Fill in the blank
The collective operation that averages gradients across all GPUs in DDP:
Two words; reduce + broadcast in one round.
Quick check
A team's 7B model trains at 50% scaling efficiency on 16 GPUs (NVLink, intra-node). They expected 80%+. Most likely first thing to check:

Key takeaways

  1. DDP replicates weights, splits the batch, AllReduces gradients. That’s the entire algorithm.
  2. Ring-AllReduce moves 2(N-1)/N × params × bytes_per_grad per GPU per step. Memorize this; it’s the bandwidth cost.
  3. Gradient bucketing → ~10× higher achieved bandwidth. Default 25 MB; rarely needs tuning.
  4. Comm-compute overlap turns serial into parallel. Wall time becomes max(compute, comm) instead of compute + comm.
  5. DDP works well up to ~32 GPUs for typical models. Beyond, graduate to FSDP / TP / PP.

Go deeper

Prereq: Backprop as a Graph. DDP is what wraps backprop when you have more than one GPU.

TL;DR

  • Data parallel (DP) = each GPU holds a full copy of the model weights, processes a different slice of the batch, then averages gradients across GPUs at backward time.
  • PyTorch DDP is the canonical implementation. Wraps your model in one line; the framework adds AllReduce calls during backward.
  • The communication primitive is Ring-AllReduce — gradients flow around the GPU ring in two passes (reduce-scatter, all-gather). Cost: 2 × (N-1) × params / N bytes per GPU per step.
  • Gradient bucketing groups small parameter tensors into fixed-size buckets so the AllReduce traffic happens in a few large transfers instead of many tiny ones. ~10× higher achieved bandwidth.
  • Comm-compute overlap — start AllReduce on layer N’s gradients while still computing backward on layer N-1. This is what gets DDP from “parallel but not scaling” to “70%+ efficient up to ~32 GPUs”.

Why this matters

Every distributed training paradigm — TP, PP, FSDP, EP — composes with DDP at the outermost level. DDP is the foundation. Knowing what AllReduce costs, when bucketing helps, and how comm-compute overlap works is what lets you reason about why a multi-GPU training run is or isn’t scaling. The numbers — bytes-per-step, scaling efficiency at N GPUs — are the conversation language for any production-training engineer.

Mental model

Forward and backward run in parallel; the AllReduce is the only synchronization.

Concrete walkthrough

Wrap your model

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group(backend="nccl") torch.cuda.set_device(local_rank) model = MyModel().to(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop is the same for batch in dataloader: out = model(batch) loss = criterion(out, target) loss.backward() # DDP injects AllReduce here optimizer.step()

That’s it. The framework intercepts the autograd backward, accumulates gradients per parameter, and at appropriate moments launches an AllReduce that averages the gradients across all participating GPUs.

Ring-AllReduce, in pictures

For N GPUs in a logical ring, an AllReduce of P parameters proceeds in two phases:

Reduce-Scatter: each GPU starts with the full P parameters’ worth of local gradients. They split into N chunks of P/N each. Then for N-1 rounds:

  • Each GPU sends one chunk to the next GPU in the ring.
  • Each GPU receives one chunk from the previous GPU and adds it to its local copy.

After N-1 rounds, each GPU has the fully-summed gradient for one of the N chunks.

All-Gather: each GPU now broadcasts its summed chunk to the rest of the ring, in another N-1 rounds.

Total bytes moved per GPU: 2 × (N-1)/N × P × bytes_per_param. For 8 GPUs, that’s about 1.75 × P × bytes_per_param.

This is bandwidth-optimal — no algorithm can beat it within a constant factor — and is what NCCL implements under the hood.

Bucket the gradients

Naive DDP would call AllReduce once per parameter tensor. For a 7B model with thousands of named parameters, that’s thousands of small AllReduces — disastrous bandwidth utilization (~10% of peak NVLink).

DDP buckets parameters into fixed-size groups (default: 25 MB):

model = DDP(model, device_ids=[local_rank], bucket_cap_mb=25)

Now there are ~50–100 large AllReduces per step instead of 10000. Achieved bandwidth jumps to ~80–95% of peak. For most models, the default bucket size is fine; only tune it if profiling shows comm time dominating.

Overlap comm with compute

A naive backward pass runs all backward ops, then does the AllReduce. Comm and compute serialize → wall time = compute + comm.

Real DDP overlaps them: as soon as a layer’s gradients are computed, its AllReduce launches asynchronously. While the AllReduce is in flight on the network, the GPU keeps computing the next layer’s backward.

# Conceptual; DDP does this internally def backward_with_overlap(): for layer in reversed(layers): grad = layer.backward() # Launch async AllReduce on this layer's grad bucket if full if bucket_full(layer): handle = dist.all_reduce(bucket, async_op=True) handles.append(handle) for h in handles: h.wait()

Wall time becomes max(compute, comm) instead of compute + comm. For most models with reasonable network bandwidth, this hides most of the AllReduce — getting you from 50% scaling efficiency to 80%+.

Cost model

Per-step bytes per GPU for AllReduce:

bytes_per_step = 2 × (N-1)/N × params × bytes_per_grad

For a 7B model in BF16 across 8 GPUs:

bytes = 2 × 7/8 × 7e9 × 2 = ~24 GB / GPU / step

At 50 GB/s effective inter-GPU bandwidth (NVLink Gen 4), that’s ~500 ms of pure communication per step. For a step with 200 ms of compute, comm dominates and DDP scales poorly. For larger models or tight networks, you graduate to FSDP, TP, or PP — covered in the next lessons.

The crossover for typical configs: DDP works well up to ~32 GPUs; beyond, communication starts to dominate and you need to shard.

When DDP wins

  • Model fits on a single GPU. If the model + activations + optimizer state don’t fit, you can’t replicate; you need FSDP/ZeRO.
  • Batch size scales well. Some training tasks (small models, small dataset) hit batch-size diminishing returns long before DDP saturates.
  • Network is good. NVLink within a node is fine; cross-node Ethernet at 25 Gbps is where DDP starts to break.

Practical knobs

DDP( model, device_ids=[local_rank], bucket_cap_mb=25, # default 25; rarely needs tuning find_unused_parameters=False, # set True only if conditional execution gradient_as_bucket_view=True, # save memory; default in modern PyTorch static_graph=True, # if no conditional ops; better overlap )

find_unused_parameters=True is a footgun — it adds bookkeeping overhead even when not needed. Default to False; only enable if you actually have conditional graph branches.

Run it in your browser — DDP cost model simulator

Python — editablePlug in your model and network; see expected per-step communication time and scaling efficiency.
Ctrl+Enter to run

You’ll see efficiency hold above 80% up to ~16 GPUs for the 7B, then collapse beyond ~32 GPUs as comm time exceeds compute. That’s exactly the regime where you stop using pure DDP and graduate to FSDP / TP.

Quick check

Fill in the blank
The collective operation that averages gradients across all GPUs in DDP:
Two words; reduce + broadcast in one round.
Quick check
A team's 7B model trains at 50% scaling efficiency on 16 GPUs (NVLink, intra-node). They expected 80%+. Most likely first thing to check:

Key takeaways

  1. DDP replicates weights, splits the batch, AllReduces gradients. That’s the entire algorithm.
  2. Ring-AllReduce moves 2(N-1)/N × params × bytes_per_grad per GPU per step. Memorize this; it’s the bandwidth cost.
  3. Gradient bucketing → ~10× higher achieved bandwidth. Default 25 MB; rarely needs tuning.
  4. Comm-compute overlap turns serial into parallel. Wall time becomes max(compute, comm) instead of compute + comm.
  5. DDP works well up to ~32 GPUs for typical models. Beyond, graduate to FSDP / TP / PP.

Go deeper