Use TraceML with W&B, MLflow, or TensorBoard¶
You do not need to replace your existing tracking stack to use TraceML.
TraceML is designed to work alongside tools like:
- Weights & Biases (W&B)
- MLflow
- TensorBoard
- Lightning loggers
- other experiment or metric tracking tools
A good mental model is:
- your existing stack tracks runs, configs, metrics, and artifacts
- TraceML explains why training is slow
The short version¶
Use your existing tools for:
- experiment tracking
- loss and accuracy curves
- artifacts and checkpoints
- long-term dashboards
- team reporting
Use TraceML for:
- input bottlenecks
- compute-bound steps
- DDP stragglers
- wait-heavy behavior
- memory creep
- step-aware bottleneck diagnosis
- compact run-to-run comparison from saved TraceML summary JSON files
TraceML is the training bottleneck finder in the stack.
A common setup¶
A common setup looks like this:
- Hugging Face, Lightning, or a normal PyTorch loop for training
- W&B or MLflow for experiment tracking
- TraceML for bottleneck diagnosis during runs and end-of-run summaries
That means:
- keep your current tracking workflow
- add TraceML when you need to understand throughput problems
Why this setup makes sense¶
Tracking tools answer questions like:
- what config did this run use?
- what was the loss curve?
- which checkpoint belongs to this run?
- how did this run compare to the last one?
TraceML answers questions like:
- is the job input-bound or compute-bound?
- is one rank slower than the others?
- is the job spending time waiting?
- is memory drifting upward over time?
- which phase of the step is the real bottleneck?
These are different jobs.
Example: W&B + TraceML¶
You can keep using W&B logging in your training script and launch the run through TraceML.
Example:
import traceml
import wandb
wandb.init(project="my-project")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
wandb.log({"loss": loss.item()})
Run:
traceml run train.py
This gives you:
- W&B for run tracking
- TraceML for live bottleneck diagnosis
Log the final TraceML summary to W&B¶
If you want a quieter run and a structured TraceML summary you can log at the end, launch in summary mode:
traceml run train.py --mode=summary
Then request the finalized TraceML summary near the end of your script and log selected fields into W&B:
import traceml
import wandb
wandb.init(project="my-project")
# training loop ...
summary = traceml.final_summary(print_text=True)
if summary is not None:
wandb.summary["traceml/step_time_status"] = summary["step_time"][
"diagnosis"
]["status"]
wandb.summary["traceml/step_memory_status"] = summary["step_memory"][
"diagnosis"
]["status"]
wandb.summary["traceml/duration_s"] = summary.get("duration_s")
This lets W&B stay your experiment system of record while TraceML contributes a clean bottleneck diagnosis at the end of the run.
If you also store final_summary.json as a run artifact, you can reuse it
later with traceml compare.
Example: MLflow + TraceML¶
You can do the same with MLflow.
import traceml
import mlflow
mlflow.start_run()
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
mlflow.log_metric("loss", float(loss.item()), step=step)
Run:
traceml run train.py
This gives you:
- MLflow for experiment tracking
- TraceML for bottleneck diagnosis
Log the final TraceML summary to MLflow¶
TraceML can also return a structured final summary for MLflow logging:
import traceml
import mlflow
mlflow.start_run()
# training loop ...
summary = traceml.final_summary(print_text=True)
if summary is not None:
mlflow.log_param(
"traceml_step_time_status",
summary["step_time"]["diagnosis"]["status"],
)
mlflow.log_param(
"traceml_step_memory_status",
summary["step_memory"]["diagnosis"]["status"],
)
mlflow.log_dict(summary, "traceml/final_summary.json")
This is a good fit when you want both a compact diagnosis in your run metadata and the full TraceML summary JSON attached to the run.
That attached JSON is also a good input for traceml compare later.
Compare runs later¶
Once you have TraceML final summary JSON files for two runs, compare them with:
traceml compare run_a.json run_b.json
This is useful when:
- a training change may have made runs slower
- a dataloader or preprocessing change may have shifted the bottleneck
- a model or optimizer change may have moved time into a different phase
- dashboards look similar but throughput still feels worse
A practical workflow is:
- run training with TraceML summary mode
- keep the TraceML final summary JSON as a W&B or MLflow artifact
- compare two saved summaries locally with
traceml compare
See Compare Runs.
Keeping terminal output clean¶
If you use TraceML in CLI mode together with other loggers, the terminal can get noisy.
Good ways to reduce that:
- disable
tqdmprogress bars - reduce extra console logging
- use
--mode=summaryif you only want the final TraceML summary - use the local UI if the terminal feels crowded
Launch the local UI with:
traceml run train.py --mode=dashboard
This is often the cleanest option when you want:
- existing loggers to keep running
- TraceML visible at the same time
- less terminal clutter
Do I need to disable W&B or MLflow?¶
No.
For the cleanest local demo experience, you may temporarily disable external reporting, but it is not required.
For example, in some Hugging Face runs you might set:
report_to="none"
disable_tqdm=True
That is just to keep the terminal cleaner during local diagnosis runs.
It is not a TraceML requirement.
Best way to adopt TraceML¶
A clean adoption path is:
- start with
traceml run train.py - use
--mode=summarywhen you want a quieter run and a structured final summary - log selected TraceML summary fields into W&B or MLflow if useful
- keep the TraceML final summary JSON for important runs
- compare two saved runs later with
traceml compare
This usually gives the best balance between:
- low overhead
- clear diagnosis
- compatibility with your current stack
- simple run-to-run review