Debug low GPU utilization in PyTorch training¶
Use this guide when PyTorch training is slow and GPU utilization is low, moderate, uneven, or bursty.
Low GPU utilization is a symptom. TraceML helps decide whether the likely cause is input loading, host-to-device transfer, wait time, rank skew, memory pressure, or model-side compute behavior.
Confirm the symptom¶
Run your training script:
traceml run train.py
In the System section, TraceML can report GPU utilization symptoms as:
| Structured kind | Text status | Meaning |
|---|---|---|
LOW_GPU_UTILIZATION |
LOW GPU UTIL |
average GPU utilization was below 30% |
MODERATE_GPU_UTILIZATION |
MODERATE GPU UTIL |
average GPU utilization was from 30% through 70% |
Above 70%, TraceML does not emit a GPU-utilization issue unless another System rule fires.
Explain it with Step Time¶
After confirming low or moderate GPU utilization, read the Step Time diagnosis.
| Step Time diagnosis | What to do |
|---|---|
INPUT-BOUND |
Inspect the DataLoader and input path. |
INPUT STRAGGLER |
Inspect the slow input rank. |
WAIT-HEAVY |
Inspect work outside traced phases, such as logging, checkpointing, validation, CPU stalls, framework orchestration, or unobserved transfers. |
COMPUTE-BOUND |
Inspect forward, backward, and optimizer time before changing the DataLoader. |
COMPUTE STRAGGLER or STRAGGLER |
Inspect rank skew and the called-out worst rank. |
BALANCED |
Compare against a known good run or use a heavier profiler for lower-level detail. |
Low GPU utilization plus INPUT-BOUND is a strong signal to start with the
DataLoader bottleneck guide. Low GPU
utilization by itself is not enough.
Checks that match TraceML evidence¶
If input time is high:
- inspect
DataLoader(num_workers=...) - inspect CPU preprocessing, decoding, tokenization, and collation
- check slow storage or network filesystems
- compare with a synthetic-data run
If H2D time is high:
- check host-to-device transfer behavior
- for CUDA training, check whether the input path uses
pin_memory=Trueand non-blocking transfer where appropriate
If wait time is high:
- inspect logging, checkpointing, validation, and framework work around the traced training step
- inspect CPU or RAM pressure in the System and Process sections
If one rank is worse than the others:
- use the DDP rank straggler guide
- compare the worst rank with the median rank
Compare a fix¶
After changing the suspected cause, compare the before and after summaries:
traceml compare old_run/final_summary.json new_run/final_summary.json
Check whether GPU utilization, total step time, input time, wait time, or the primary diagnosis changed.