Find PyTorch memory creep during training¶
Use this guide when PyTorch GPU memory rises during training, memory headroom shrinks over time, or a run eventually gets close to out-of-memory behavior.
TraceML step-memory diagnostics separate memory pressure, memory imbalance, and memory growth over the observed window.
Run TraceML¶
Run your training script:
traceml run train.py
Then read the Step Memory section in:
logs/<run_name>/final_summary.txt
logs/<run_name>/final_summary.json
What to look for¶
The most relevant Step Memory diagnoses are:
| Diagnosis status | Structured kind | Meaning |
|---|---|---|
MEMORY CREEP |
CREEP_CONFIRMED |
memory is rising across the window |
MEMORY RISING |
CREEP_EARLY |
memory is rising from early to recent steps |
HIGH PRESSURE |
HIGH_PRESSURE |
memory is near device capacity |
IMBALANCE |
IMBALANCE |
peak memory differs materially across ranks |
Start with the diagnosis, then inspect the note, worst rank, and memory trend.
What to check after MEMORY CREEP¶
Common causes to inspect:
- graph-backed tensors retained across steps
- appending
loss,logits, hidden states, or activations to a list without detaching them - caches that grow during training
- step-local state that stays referenced after the step ends
- validation or logging code that stores tensors instead of scalar values
If a worst rank is shown, inspect that rank first.
What to check after MEMORY RISING¶
MEMORY RISING is an early signal. It means the observed window is moving up,
but it is weaker than MEMORY CREEP.
Good next steps:
- let the run continue long enough to collect another window
- check whether the same trend appears again
- compare with a known stable run
Pressure and imbalance are different¶
HIGH PRESSURE means the run is close to device memory capacity. It does not
by itself prove memory is growing.
IMBALANCE means one rank is using materially more memory than the typical
rank. Inspect per-rank workload, input shapes, and rank-local branches.
System GPU memory and process GPU memory are useful context, but Step Memory is the section that diagnoses training-step memory trend.
Compare a fix¶
After detaching tensors, clearing caches, changing logging, or reducing memory load, compare the before and after summaries:
traceml compare old_run/final_summary.json new_run/final_summary.json
Check whether Step Memory diagnosis, peak memory, memory trend, and worst-rank skew changed.