Hugging Face Trainer Integration¶
Use TraceML with Hugging Face Trainer without rewriting your training loop.
The preferred integration is two steps: call
traceml_ai.integrations.huggingface.init() once, then pass
TraceMLTrainerCallback (a standard transformers.TrainerCallback) to your
existing Trainer. The legacy TraceMLTrainer subclass is still supported. It
is now a thin wrapper that installs the same callback under the hood.
1. Install¶
pip install "traceml-ai[hf]"
If you are running the full examples below, install their optional dependencies:
pip install datasets torchvision
2. Initialize TraceML And Add TraceMLTrainerCallback¶
Call init() once before constructing the Trainer, then register the
callback alongside transformers.Trainer:
from traceml_ai.integrations import huggingface as traceml_hf
from transformers import Trainer, TrainingArguments
traceml_hf.init()
training_args = TrainingArguments(
output_dir="./output",
report_to="none",
disable_tqdm=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[traceml_hf.TraceMLTrainerCallback()],
)
trainer.train()
traceml_hf.init() installs TraceML's process-wide instrumentation:
DataLoader fetch timing, the H2D Tensor.to patch, and the
forward/backward/optimizer auto-timers. The callback is a per-step bracket and
cannot install these on its own, so calling init() first is what lets TraceML
attribute DataLoader fetch time and host-to-device copies. It is idempotent
and safe to call once at startup.
You do not need to add traceml.trace_step(...) manually. The callback opens
and closes trace_step around each optimizer step, and the auto-timers init()
installed capture forward, backward, h2d, and optimizer phases inside that
bracket.
Legacy TraceMLTrainer¶
The TraceMLTrainer(Trainer) subclass remains supported for users who already
adopted it. It is now a thin wrapper that auto-installs
TraceMLTrainerCallback on construction and accepts traceml_enabled to turn
step-level instrumentation on or off:
from traceml_ai.integrations import huggingface as traceml_hf
traceml_hf.init()
trainer = traceml_hf.TraceMLTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
traceml_enabled=True,
)
trainer.train()
New code should prefer the direct callback registration shown above.
3. Launch The Run¶
Single GPU:
traceml run fine_tune.py
Single-node multi-GPU DDP:
traceml run fine_tune.py --nproc-per-node=4
For multi-node DDP launch commands, see Distributed Training.
Recommended TrainingArguments¶
These settings are optional, but they make local TraceML diagnostic runs easier to read:
| Setting | Why it helps |
|---|---|
disable_tqdm=True |
Prevents the Hugging Face progress bar from fighting with the TraceML live CLI. |
report_to="none" |
Keeps tracker output out of the terminal during local diagnosis. |
save_strategy="no" |
Avoids checkpoint files during short diagnostic runs. |
TraceML can still run alongside W&B, MLflow, and TensorBoard. For tracker logging patterns, see W&B / MLflow.
Limitations¶
The callback path is the right default for most users, but it has structural trade-offs vs. the legacy subclass that are worth knowing:
- Step granularity. One TraceML step equals one optimizer step. With
gradient_accumulation_steps=N, forward and backward times from allNaccumulated micro-batches fold into a single TraceML step. The pre-refactorTraceMLTrainer.training_stepoverride counted each micro-batch as its own TraceML step. If you need per-micro-batch attribution under gradient accumulation, open an issue. - Optimizer timing. Captured by TraceML's global optimizer hooks, which
are installed automatically only when running under the default
traceml.init(mode="auto")path. Undermanualorselectivemodes optimizer events are not emitted, but this applies consistently to every step, so dashboard step alignment still holds. - Exception safety. If
training_stepraises, Hugging Face does not callon_step_end. The callback defensively closes the trace_step context on the nexton_step_begin, onon_train_begin(so a reused callback instance whose previous run crashed mid-step does not bleed a leaked auto-timer flag into aneval_on_start=Trueevaluation), and onon_train_end. Even so, the step where the exception occurred may be reported with incomplete timing or memory. The legacy subclass'swith trace_step(model): super().training_step(...)pattern ran thefinallycleanup deterministically. If you need precise attribution on failing steps, the legacyTraceMLTrainerpath is stricter. - Callback registration timing. Pass
TraceMLTrainerCallback()atTrainer(...)construction. Callbacks added viatrainer.add_callback(...)aftertrainer.train()has started will not receiveon_train_begin, which means the first step may be outside TraceML's callback-managedtrace_stepbracket.
Troubleshooting¶
Terminal output overlaps with TraceML¶
Set disable_tqdm=True in TrainingArguments.
If output is still noisy, use browser dashboard mode on single-node runs:
pip install "traceml-ai[dashboard]"
traceml run fine_tune.py --mode=dashboard
Multi-GPU run only shows one rank¶
Make sure you launched through TraceML with --nproc-per-node, not plain
python:
traceml run fine_tune.py --nproc-per-node=4
I want a baseline without TraceML¶
Run the same script with TraceML disabled:
traceml run fine_tune.py --disable-traceml
This launches your script natively through torchrun without TraceML telemetry.
Full Examples¶
Use these examples when you want a complete runnable script. If you already have a Hugging Face training script, start with the smaller replacement pattern above.
NLP classification example
Save as `fine_tune_nlp.py`:import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from traceml_ai.integrations import huggingface as traceml_hf
def main():
traceml_hf.init()
model_name = "prajjwal1/bert-mini"
output_dir = "./hf_nlp_output"
os.makedirs(output_dir, exist_ok=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=4,
).to(device)
raw_dataset = load_dataset("ag_news", split="train[:2000]")
def tokenize(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=64,
)
dataset = raw_dataset.map(tokenize, batched=True)
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=32,
num_train_epochs=3,
logging_steps=10,
save_strategy="no",
use_cpu=(device == "cpu"),
report_to="none",
disable_tqdm=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
callbacks=[traceml_hf.TraceMLTrainerCallback()],
)
trainer.train()
if __name__ == "__main__":
main()
traceml run fine_tune_nlp.py
Vision classification example
Save as `fine_tune_vision.py`:import os
import torch
from datasets import load_dataset
from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ToTensor
from transformers import (
AutoImageProcessor,
AutoModelForImageClassification,
DefaultDataCollator,
Trainer,
TrainingArguments,
)
from traceml_ai.integrations import huggingface as traceml_hf
def main():
traceml_hf.init()
model_name = "google/vit-base-patch16-224-in21k"
output_dir = "./hf_vision_output"
os.makedirs(output_dir, exist_ok=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(
model_name,
num_labels=10,
).to(device)
dataset = load_dataset("cifar10", split="train[:2000]")
transform = Compose(
[
RandomResizedCrop(
image_processor.size["height"],
scale=(0.8, 1.0),
),
ToTensor(),
Normalize(
mean=image_processor.image_mean,
std=image_processor.image_std,
),
]
)
def preprocess(example):
image = example["img"].convert("RGB")
example["pixel_values"] = transform(image)
example["labels"] = example["label"]
return example
dataset = dataset.map(preprocess)
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=16,
num_train_epochs=2,
logging_steps=10,
save_strategy="no",
report_to="none",
disable_tqdm=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=DefaultDataCollator(),
callbacks=[traceml_hf.TraceMLTrainerCallback()],
)
trainer.train()
if __name__ == "__main__":
main()
traceml run fine_tune_vision.py
Reference¶
init() takes no arguments. Call it once before constructing the Trainer to
install TraceML's process-wide patches (DataLoader fetch timing, H2D
Tensor.to, and the forward/backward/optimizer auto-timers). It is idempotent
and returns the effective TraceMLInitConfig.
TraceMLTrainerCallback() takes no TraceML-specific arguments and records
standard step-level timing and memory.
TraceMLTrainer (legacy thin wrapper) accepts:
- everything that normal
transformers.Traineraccepts traceml_enabled=True|False