Use TraceML with W&B, MLflow, or TensorBoard¶
You do not need to replace your existing tracking stack to use TraceML.
TraceML is designed to work alongside tools like:
- Weights & Biases (W&B)
- MLflow
- TensorBoard
- Lightning loggers
- other experiment or metric tracking tools
A good mental model is:
- your existing stack tracks runs, configs, metrics, and artifacts
- TraceML explains why training is slow
The short version¶
Use your existing tools for:
- experiment tracking
- loss and accuracy curves
- artifacts and checkpoints
- long-term dashboards
- team reporting
Use TraceML for:
- input bottlenecks
- compute-bound steps
- DDP stragglers
- wait-heavy behavior
- memory creep
- step-aware bottleneck diagnosis
- compact run-to-run comparison from saved TraceML summary JSON files
TraceML is the training bottleneck finder in the stack.
A common setup¶
A common setup looks like this:
- Hugging Face, Lightning, or a normal PyTorch loop for training
- W&B or MLflow for experiment tracking
- TraceML for bottleneck diagnosis during runs and end-of-run summaries
That means:
- keep your current tracking workflow
- add TraceML when you need to understand throughput problems
Why this setup makes sense¶
Tracking tools answer questions like:
- what config did this run use?
- what was the loss curve?
- which checkpoint belongs to this run?
- how did this run compare to the last one?
TraceML answers questions like:
- is the job input-bound or compute-bound?
- is one rank slower than the others?
- is the job spending time waiting?
- is memory drifting upward over time?
- which phase of the step is the real bottleneck?
These are different jobs.
Example: W&B + TraceML¶
You can keep using W&B logging in your training script and launch the run through TraceML.
Example:
import traceml_ai as traceml
import wandb
wandb.init(project="my-project")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
wandb.log({"loss": loss.item()})
Run:
traceml run train.py
This gives you:
- W&B for run tracking
- TraceML for the final bottleneck summary
Log the final TraceML summary to W&B¶
TraceML runs in summary mode by default. If you want to make that explicit, launch with:
traceml run train.py --mode=summary
Then request TraceML's compact tracker summary near the end of your script and log it into W&B:
import traceml_ai as traceml
import wandb
wandb.init(project="my-project")
# training loop ...
summary = traceml.summary(print_text=True)
if summary is not None:
wandb.log(summary)
This lets W&B stay your experiment system of record while TraceML contributes a clean bottleneck diagnosis at the end of the run.
traceml.summary() returns a flat dict of diagnosis statuses and global
average metrics. It is derived from the canonical final_summary.json artifact,
which TraceML generates once per run and reuses on later calls. Keep that JSON
as a run artifact when you want to use traceml compare.
Example: MLflow + TraceML¶
You can do the same with MLflow.
import traceml_ai as traceml
import mlflow
mlflow.start_run()
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
mlflow.log_metric("loss", float(loss.item()), step=step)
Run:
traceml run train.py
This gives you:
- MLflow for experiment tracking
- TraceML for bottleneck diagnosis
Log the final TraceML summary to MLflow¶
TraceML can also return a compact tracker summary for MLflow logging:
import traceml_ai as traceml
import mlflow
mlflow.start_run()
# training loop ...
summary = traceml.summary(print_text=True)
if summary is not None:
numeric = {
k: v for k, v in summary.items()
if isinstance(v, (int, float)) and not isinstance(v, bool)
}
tags = {
k.replace("/", "."): v for k, v in summary.items()
if isinstance(v, str)
}
mlflow.log_metrics(numeric)
mlflow.set_tags(tags)
This is a good fit when you want a compact diagnosis in your run metadata.
If you also want the full TraceML artifact, call traceml.final_summary() and
attach that JSON to the run:
full = traceml.final_summary()
if full is not None:
mlflow.log_dict(full, "traceml/final_summary.json")
That attached JSON is a good input for traceml compare later.
Compare runs later¶
Once you have TraceML final summary JSON files for two runs, compare them with:
traceml compare run_a.json run_b.json
This is useful when:
- a training change may have made runs slower
- a dataloader or preprocessing change may have shifted the bottleneck
- a model or optimizer change may have moved time into a different phase
- dashboards look similar but throughput still feels worse
A practical workflow is:
- run training with TraceML
- keep the TraceML final summary JSON as a W&B or MLflow artifact
- compare two saved summaries locally with
traceml compare
See Compare Runs.
Keeping terminal output clean¶
If you use TraceML in CLI mode together with other loggers, the terminal can get noisy.
Good ways to reduce that:
- disable
tqdmprogress bars - reduce extra console logging
- use the default summary mode if you only want the final TraceML summary
- use the local UI on a single-node run if the terminal feels crowded
Launch the local UI with:
pip install "traceml-ai[dashboard]"
traceml run train.py --mode=dashboard
Dashboard mode is intended for single-node runs. For multi-node runs, keep the default summary mode and log the final summary artifact.
This is often the cleanest option when you want:
- existing loggers to keep running
- TraceML visible at the same time
- less terminal clutter
Do I need to disable W&B or MLflow?¶
No.
For the cleanest local demo experience, you may temporarily disable external reporting, but it is not required.
For example, in some Hugging Face runs you might set:
report_to="none"
disable_tqdm=True
That is just to keep the terminal cleaner during local diagnosis runs.
It is not a TraceML requirement.
Best way to adopt TraceML¶
A clean adoption path is:
- start with
traceml run train.py - use
--mode=clior--mode=dashboardwhen you want live feedback on a single-node run - log selected TraceML summary fields into W&B or MLflow if useful
- keep the TraceML final summary JSON for important runs
- compare two saved runs later with
traceml compare
This usually gives the best balance between:
- low overhead
- clear diagnosis
- compatibility with your current stack
- simple run-to-run review