Compare Runs¶

Use traceml compare to compare two TraceML final summary JSON files from two different runs.

This is the cleanest way to answer questions like:

did the run get slower or faster?
did the diagnosis change?
did wait share increase?
did memory pressure or skew get worse?

traceml compare is designed for comparing finalized run summaries, not raw logs or raw SQLite databases.

What you need¶

You need two TraceML final summary JSON files.

A common way to produce them is:

traceml run train.py --mode=summary

Then call traceml.final_summary() near the end of your script.

If you are logging TraceML output into W&B or MLflow, you can also keep those summary JSON files as run artifacts and compare them later.

Basic usage¶

traceml compare run_a.json run_b.json

This compares:

A: the first file you pass
B: the second file you pass

TraceML writes:

a structured compare JSON
a compact text report

By default, outputs are written under a local compare/ directory in the current working directory.

Example:

compare/run_a_vs_run_b.json
compare/run_a_vs_run_b.txt

If the file names are generic, such as final_summary.json, TraceML falls back to parent directory names when naming the compare artifacts.

Choose an output name¶

If you want to control the output name, pass --output.

traceml compare run_a.json run_b.json --output=my_compare

This writes:

my_compare.json
my_compare.txt

You can also pass a path:

traceml compare run_a.json run_b.json --output=artifacts/baseline_vs_candidate

This writes:

artifacts/baseline_vs_candidate.json
artifacts/baseline_vs_candidate.txt

What the compare output shows¶

The compare report is designed to stay compact and useful.

It typically focuses on:

overall duration
step-time diagnosis changes
average step time changes
wait-share changes
step split shifts across dataloader, forward, backward, and optimizer
memory changes when they are meaningful
process or system changes when they add useful context

The text report includes a small legend near the top:

- A: <first run>
- B: <second run>
- Format: A -> B | delta = B - A

That means:

A -> B shows the value in the first run and then the second run
delta is computed as B - A

Recommended workflow¶

A good workflow is:

run TraceML in summary mode for each run you care about
save the TraceML final summary JSON file for each run
compare two runs with traceml compare
use the compare output to decide whether a regression looks real and where to dig next

Example:

traceml run train_a.py --mode=summary
traceml run train_b.py --mode=summary
traceml compare run_a.json run_b.json

This is often enough to tell whether the slowdown is coming from:

more compute time
more wait time
a phase split change
worse memory behavior
a diagnosis shift

What compare is best at today¶

TraceML compare is currently strongest for comparing:

step time
step memory
process-level context
selected system-level context

It is best used as a compact run-to-run diagnosis tool.

It is not meant to replace a full experiment tracking system.

Use W&B, MLflow, or TensorBoard for:

run metadata
metrics history
artifacts
dashboards
experiment management

Use TraceML compare for:

bottleneck changes
diagnosis changes
performance regressions you want to inspect quickly

Compatibility and missing fields¶

traceml compare is designed to degrade gracefully when fields are missing.

That means:

if one summary has a field and the other does not, comparison still runs
if a section is missing, TraceML skips noisy output instead of failing when possible
if newer TraceML versions add more fields later, older comparisons should still remain usable for the shared fields

This helps keep compare useful across incremental TraceML releases.

What files should you compare?¶

Compare:

TraceML final summary JSON files from completed runs

Do not compare:

raw database files
partial logs
screenshots
rendered text summaries alone

The JSON file is the stable machine-readable input for compare.

When compare is most useful¶

Use compare when:

a training change might have made runs slower
a new dataloader or preprocessing path may have changed throughput
a model or optimizer change may have shifted time into a different phase
memory behavior looks different between two runs
dashboards look similar but throughput feels worse