01Stand up a 2-GPU box on Modal or Lambda45 min
Spin up an instance with 2× A100-40GB (Lambda on-demand or a Modal sandbox). SSH in. Verify `nvidia-smi` shows two GPUs and `torch.cuda.device_count() == 2`. Save a `Makefile` so you can rebuild the env in one command.
checkpoint `nvidia-smi` shows two GPUs at <5% util at idle. NCCL banner appears when you `python -c "import torch.distributed as d; d.init_process_group('nccl')"` under torchrun.
watch out On Lambda the two GPUs may not have NVLink (some on-demand instances are PCIe-only). NVLink ≈ 600 GB/s, PCIe-4 ≈ 32 GB/s — the FSDP collective hot path moves at PCIe speed and your ceiling drops. Check `nvidia-smi topo -m`. If you see PIX/SYS instead of NV2/NV4, expect 30–40% efficiency, not 70%; pick a Modal SXM4 instance instead.
02Single-GPU baseline — measure tokens/sec/GPU60 min
Take nanoGPT-1.3B (or any 1.3B GPT config — 24 layers, 16 heads, d=2048). Train on 1 GPU with bf16 and a sequence length of 2048. Record steady-state tokens/sec. This is the denominator for efficiency.
checkpoint Stable tok/s number, e.g. ~3.5K tok/s on A100-40GB. No OOM at batch 4.
watch out Running with `torch.compile` on the baseline but not on FSDP (or vice versa) makes the comparison meaningless. Decide upfront — both compiled or both eager — and stick to it.
03Wrap the model in FSDP2 — the right way120 min
Replace `nn.Module` parent setup with `fully_shard()` from `torch.distributed._composable.fsdp`. Wrap each transformer block individually (per-block sharding, the FSDP2 idiom) plus the embedding layer separately. Use `MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)`.
checkpoint `torchrun --nproc_per_node=2 train.py` runs without error; both GPUs show ~100% util during forward+backward; loss curve matches single-GPU baseline within noise for the first 100 steps.
watch out FSDP2 is *not* `FullyShardedDataParallel` (FSDP1 — the old class). The new API is `fully_shard()`, called per-module, no class wrap. If you find yourself writing `auto_wrap_policy`, you're on the legacy API. The 2024 PyTorch docs still mostly show FSDP1; use the FSDP2 examples in TorchTitan as the reference.
04Capture a profiler trace — find the bandwidth bar90 min
`torch.profiler.profile(activities=[CPU, CUDA], schedule=...).export_chrome_trace("trace.json")` for 5 steady-state iterations. Open in Perfetto. Find the AllGather (forward/backward param fetch) and ReduceScatter (gradient sync) bars on the NCCL stream. Note their wall-clock cost vs. compute kernels.
checkpoint Trace screenshot annotated with: (a) the AllGather before each block's forward, (b) the ReduceScatter after each block's backward, (c) the % of step time that is collectives vs. compute.
watch out Without `record_shapes=True` and `with_stack=True`, you can't tell which AllGather is for which layer. With them, the trace file balloons to 100 MB+. Profile *one* step in detail; profile 5 for the timing numbers.
05Tune for the 70% efficiency target120 min
Move the dials that matter for 2-GPU FSDP2: enable `reshard_after_forward=True` only on big blocks, use `forward_prefetch=True`, turn on activation checkpointing, raise per-GPU batch (the kingpin of compute/comm overlap). Re-run, re-measure tok/s/GPU.
checkpoint tok/s with 2 GPUs / (tok/s with 1 GPU × 2) ≥ 0.70. Document which knob bought you each percentage point.
watch out Activation checkpointing trades compute for memory; if you over-apply it, comm overlap improves but compute time grows and net throughput drops. The plot showing efficiency vs. activation-checkpoint coverage is the most useful artifact in this whole capstone.
06Repo + write-up60 min
GitHub repo with: `train.py`, `bench.py`, the annotated Perfetto screenshot, and a 1-page `README.md` showing the efficiency table (1 GPU, 2 GPU eager, 2 GPU + each tuning step). Push.
checkpoint A reader can clone, run `make`, get the same numbers within 5%, and explain each row of your efficiency table.