Debug slow DDP training and rank stragglers¶
Use this guide when PyTorch DDP training is slow, uneven, or controlled by one slow rank.
TraceML compares worst-rank and median-rank timing in distributed runs. It can surface input stragglers, compute stragglers, mixed stragglers, and wait-heavy windows.
Run DDP with TraceML¶
For single-node DDP:
traceml run train.py --nproc-per-node=4
For multi-node DDP, use the same --run-name, --nnodes,
--nproc-per-node, and --master-addr on every node. See
Distributed Training.
On Slurm, use the Slurm guide instead of writing the node flags by hand.
What to look for¶
Start with the Step Time diagnosis.
| Diagnosis | Meaning |
|---|---|
INPUT STRAGGLER |
one rank has meaningfully more dataloader burden than a typical rank |
COMPUTE STRAGGLER |
one rank has meaningfully more forward, backward, or optimizer burden than a typical rank |
STRAGGLER |
input and compute are both materially uneven |
INPUT-BOUND |
input work is broad, not just one bad rank |
WAIT-HEAVY |
meaningful residual time is not attributed to dataloader, H2D, forward, backward, or optimizer work |
Then inspect:
Worst Rank- worst vs median timing
Dataloader FetchForwardBackwardOptimizer StepWait
How to triage the worst rank¶
If the diagnosis is INPUT STRAGGLER, inspect input loading on the worst rank:
- uneven shards or batch construction
- rank-local preprocessing jitter
- slow storage path on one host
- custom collation or tokenization on one rank
If the diagnosis is COMPUTE STRAGGLER, inspect model-side work on the worst
rank:
- uneven input shapes
- rank-local branching
- extra forward, backward, or optimizer work
- framework hooks or callbacks that run on one rank
If the diagnosis is STRAGGLER, reduce the problem one phase at a time. Start
with the phase that has the largest worst-vs-median gap.
If the diagnosis is WAIT-HEAVY, remember that TraceML reports wait as
residual time. It is not direct NCCL or all-reduce timing. Inspect logging,
checkpointing, validation, CPU stalls, framework orchestration, and unobserved
transfer paths before assuming the cause.
FSDP note¶
TraceML supports single-node multi-GPU DDP and FSDP in the current public docs. Multi-node DDP summary reports are supported. Multi-node FSDP uses the same distributed launch path as DDP, but should be validated in your environment.
Use this guide for rank-skew symptoms in FSDP runs, but do not treat it as FSDP-specific collective timing.
Compare a fix¶
After changing data loading, batching, model logic, or host placement, compare the old and new summaries:
traceml compare old_run/final_summary.json new_run/final_summary.json
Look for changes in total step time, worst-rank skew, input time, compute time, wait time, and diagnosis.