Find why PyTorch training is slow¶
Use this guide when PyTorch training is slow and you do not yet know whether the bottleneck is input loading, GPU utilization, distributed rank skew, memory growth, wait time, or a run-to-run regression.
This is a triage guide. It points to focused guides instead of repeating every TraceML diagnosis.
Run one summary¶
If your script is not instrumented yet, start with the Quickstart.
Run your training script:
traceml run train.py
TraceML writes:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
Start with the final summary, then follow the branch that matches your diagnosis or symptom.
Choose the right path¶
| What you see | Start here |
|---|---|
Step Time says INPUT-BOUND |
Find DataLoader Bottlenecks |
System says LOW_GPU_UTILIZATION or MODERATE_GPU_UTILIZATION |
Debug Low GPU Utilization |
Step Time says INPUT STRAGGLER, COMPUTE STRAGGLER, or STRAGGLER |
Debug DDP Rank Stragglers |
Step Memory says MEMORY CREEP or MEMORY RISING |
Find PyTorch Memory Creep |
| A recent change made the run slower | Compare Runs |
Step Time says COMPUTE-BOUND or WAIT-HEAVY |
How to Read TraceML Output |
Quick interpretation¶
INPUT-BOUND means dataloader or input work is taking a large share of the
typical step. Confirm the input path before tuning model compute.
LOW_GPU_UTILIZATION and MODERATE_GPU_UTILIZATION are system-level symptoms.
They say the GPU was not fully busy, not why it was not fully busy. Pair them
with Step Time.
INPUT STRAGGLER, COMPUTE STRAGGLER, and STRAGGLER are distributed
signals. Inspect the called-out worst rank and compare it with the median rank.
MEMORY CREEP and MEMORY RISING are step-memory trend signals. Inspect
retained tensors, caches, and per-step state that may stay alive.
COMPUTE-BOUND means forward, backward, or optimizer time dominates the
observed step. Use TraceML to choose the hot phase, then use an operator-level
profiler if you need kernel or operator detail.
WAIT-HEAVY is residual time not attributed to dataloader, H2D, forward,
backward, or optimizer work. Inspect logging, checkpointing, validation,
CPU-side stalls, framework orchestration, or unobserved transfers.
What not to assume¶
- Low GPU utilization alone does not prove a DataLoader bottleneck.
- A DDP slowdown is not always NCCL. TraceML currently reports rank skew and residual wait, not explicit collective timing.
BALANCEDmeans no clear bottleneck in the observed window. It does not prove the run is globally optimal.- A single slow run is easier to trust after comparing it with a known good
final_summary.json.