TraceML Quickstart¶
Get from install to your first TraceML diagnosis in a few minutes.
TraceML runs with your existing PyTorch script and writes a structured
final_summary.json plus a human-readable final_summary.txt at the end of
the run.
1. Install¶
pip install traceml-ai
Using Hugging Face, Lightning, Ray, W&B, or MLflow? See Use With Your Stack.
2. Instrument Your Training Step¶
Add TraceML initialization once, then wrap the training step body:
import traceml_ai as traceml
traceml.init(mode="auto")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
Wrap the work from zero_grad(...) through optimizer.step().
3. Run Your Script¶
traceml run train.py
To try the same flow with a checked-in example first:
traceml run examples/quickstart.py --mode=summary
TraceML writes:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
Add --html-report (traceml run train.py --html-report) to also write a
shareable final_summary.html. See
Reading the output.
For DDP, FSDP, and multi-node launches, see Distributed Training.
4. Read Your Diagnosis¶
+----------------------------------------------------------------------------+
| Step Time |
| - Diagnosis: INPUT STRAGGLER |
| - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms |
| - Why: r0 input was slower than median global rank (254.5/3.8ms). |
+----------------------------------------------------------------------------+
In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.
Useful Next Commands¶
Live terminal view:
traceml run train.py --mode=cli
Browser view:
pip install "traceml-ai[dashboard]"
traceml run train.py --mode=dashboard
Compare two runs:
traceml compare run_a/final_summary.json run_b/final_summary.json