Debug low GPU utilization in PyTorch training¶

Use this guide when PyTorch training is slow and GPU utilization is low, moderate, uneven, or bursty.

Low GPU utilization is a symptom. TraceML helps decide whether the likely cause is input loading, host-to-device transfer, residual time, rank skew, memory pressure, or model-side compute behavior.

Confirm the symptom¶

Run your training script:

traceml run train.py

In the System section, TraceML can report GPU utilization symptoms as:

Structured kind	Text status	Meaning
`LOW_GPU_UTILIZATION`	`LOW GPU UTIL`	average GPU utilization was below 30%
`MODERATE_GPU_UTILIZATION`	`MODERATE GPU UTIL`	average GPU utilization was from 30% through 70%

Above 70%, TraceML does not emit a GPU-utilization issue unless another System rule fires.

Explain it with Step Time¶

After confirming low or moderate GPU utilization, read the Step Time diagnosis.

Step Time diagnosis	What to do
`INPUT-BOUND`	Inspect input loading, preprocessing, collation, and storage.
`INPUT STRAGGLER`	Inspect the slow input rank.
`H2D STRAGGLER`	Inspect host-to-device transfer on the worst rank.
`RESIDUAL STRAGGLER`	Inspect rank-local host-side work on the worst rank.
`RESIDUAL-HEAVY`	Inspect work outside traced phases, such as logging, checkpointing, validation, CPU stalls, framework orchestration, or unobserved transfers.
`COMPUTE-BOUND`	Inspect forward, backward, and optimizer time before changing the DataLoader.
`COMPUTE STRAGGLER` or `STRAGGLER`	Inspect rank skew and the called-out worst rank.
`BALANCED`	Compare against a known good run or use a heavier profiler for lower-level detail.

Low GPU utilization plus INPUT-BOUND is a strong signal to start with the input pipeline guide. Low GPU utilization by itself is not enough.

Checks that match TraceML evidence¶

If input time is high:

inspect DataLoader(num_workers=...)
inspect CPU preprocessing, decoding, tokenization, and collation
check slow storage or network filesystems
compare with a synthetic-data run

If H2D time is high:

check host-to-device transfer behavior
for CUDA training, check whether the input path uses pin_memory=True and non-blocking transfer where appropriate

If residual time is high:

inspect logging, checkpointing, validation, and framework work around the traced training step
inspect CPU or RAM pressure in the System and Process sections

If one rank is worse than the others:

use the DDP rank straggler guide
compare the worst rank with the median rank

Compare a fix¶

After changing the suspected cause, compare the before and after summaries:

traceml compare old_run/final_summary.json new_run/final_summary.json

Check whether GPU utilization, total step time, input time, residual time, or the primary diagnosis changed.

Debug low GPU utilization in PyTorch training¶

Confirm the symptom¶

Explain it with Step Time¶

Checks that match TraceML evidence¶

Compare a fix¶

Related¶