Find why PyTorch training is slow¶

Use this guide when PyTorch training is slow and you do not yet know whether the bottleneck is input loading, GPU utilization, distributed rank skew, memory growth, residual time, or a run-to-run regression.

This is a triage guide. It points to focused guides instead of repeating every TraceML diagnosis.

Run one summary¶

If your script is not instrumented yet, start with the Quickstart.

Run your training script:

traceml run train.py

TraceML writes:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Start with the final summary, then follow the branch that matches your diagnosis or symptom.

Choose the right path¶

What you see	Start here
Step Time says `INPUT-BOUND`	Find Input Pipeline Bottlenecks
System says `LOW_GPU_UTILIZATION` or `MODERATE_GPU_UTILIZATION`	Debug Low GPU Utilization
Step Time says `INPUT STRAGGLER`, `COMPUTE STRAGGLER`, `H2D STRAGGLER`, `RESIDUAL STRAGGLER`, or `STRAGGLER`	Debug DDP Rank Stragglers
Step Memory says `MEMORY CREEP` or `MEMORY RISING`	Find PyTorch Memory Creep
A recent change made the run slower	Compare Runs
Step Time says `COMPUTE-BOUND` or `RESIDUAL-HEAVY`	How to Read TraceML Output

Quick interpretation¶

INPUT-BOUND means input wait is taking a large share of the typical step. Confirm the input path before tuning model compute.

LOW_GPU_UTILIZATION and MODERATE_GPU_UTILIZATION are system-level symptoms. They say the GPU was not fully busy, not why it was not fully busy. Pair them with Step Time.

INPUT STRAGGLER, COMPUTE STRAGGLER, H2D STRAGGLER, RESIDUAL STRAGGLER, and STRAGGLER are distributed rank-skew signals. Inspect the called-out worst rank and compare it with the median rank.

MEMORY CREEP and MEMORY RISING are step-memory trend signals. Inspect retained tensors, caches, and per-step state that may stay alive.

COMPUTE-BOUND means forward, backward, or optimizer time dominates the observed step. Use TraceML to choose the hot phase, then use an operator-level profiler if you need kernel or operator detail.

RESIDUAL-HEAVY is residual time not attributed to input wait, H2D, forward, backward, or optimizer work. Inspect logging, checkpointing, validation, CPU-side stalls, framework orchestration, or unobserved transfers.

What not to assume¶

Low GPU utilization alone does not prove an input pipeline bottleneck.
A distributed slowdown is not always NCCL. TraceML currently reports rank skew and residual time, not explicit collective timing. In FSDP, forward time may include parameter all-gather wait.
BALANCED means no clear bottleneck in the observed window. It does not prove the run is globally optimal.
A single slow run is easier to trust after comparing it with a known good final_summary.json.

Find why PyTorch training is slow¶

Run one summary¶

Choose the right path¶

Quick interpretation¶

What not to assume¶

Related¶