Skip to content

Hugging Face Trainer Integration

Use TraceML with Hugging Face Trainer without rewriting your training loop.

The preferred integration is two steps: call traceml_ai.integrations.huggingface.init() once, then pass TraceMLTrainerCallback (a standard transformers.TrainerCallback) to your existing Trainer. The legacy TraceMLTrainer subclass is still supported. It is now a thin wrapper that installs the same callback under the hood.

1. Install

pip install "traceml-ai[hf]"

If you are running the full examples below, install their optional dependencies:

pip install datasets torchvision

2. Initialize TraceML And Add TraceMLTrainerCallback

Call init() once before constructing the Trainer, then register the callback alongside transformers.Trainer:

from traceml_ai.integrations import huggingface as traceml_hf
from transformers import Trainer, TrainingArguments

traceml_hf.init()

training_args = TrainingArguments(
    output_dir="./output",
    report_to="none",
    disable_tqdm=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[traceml_hf.TraceMLTrainerCallback()],
)

trainer.train()

traceml_hf.init() installs TraceML's process-wide instrumentation: DataLoader fetch timing, the H2D Tensor.to patch, and the forward/backward/optimizer auto-timers. The callback is a per-step bracket and cannot install these on its own, so calling init() first is what lets TraceML attribute DataLoader fetch time and host-to-device copies. It is idempotent and safe to call once at startup.

You do not need to add traceml.trace_step(...) manually. The callback opens and closes trace_step around each optimizer step, and the auto-timers init() installed capture forward, backward, h2d, and optimizer phases inside that bracket.

Legacy TraceMLTrainer

The TraceMLTrainer(Trainer) subclass remains supported for users who already adopted it. It is now a thin wrapper that auto-installs TraceMLTrainerCallback on construction and accepts traceml_enabled to turn step-level instrumentation on or off:

from traceml_ai.integrations import huggingface as traceml_hf

traceml_hf.init()

trainer = traceml_hf.TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    traceml_enabled=True,
)
trainer.train()

New code should prefer the direct callback registration shown above.

3. Launch The Run

Single GPU:

traceml run fine_tune.py

Single-node multi-GPU DDP:

traceml run fine_tune.py --nproc-per-node=4

For multi-node DDP launch commands, see Distributed Training.

These settings are optional, but they make local TraceML diagnostic runs easier to read:

Setting Why it helps
disable_tqdm=True Prevents the Hugging Face progress bar from fighting with the TraceML live CLI.
report_to="none" Keeps tracker output out of the terminal during local diagnosis.
save_strategy="no" Avoids checkpoint files during short diagnostic runs.

TraceML can still run alongside W&B, MLflow, and TensorBoard. For tracker logging patterns, see W&B / MLflow.

Limitations

The callback path is the right default for most users, but it has structural trade-offs vs. the legacy subclass that are worth knowing:

  • Step granularity. One TraceML step equals one optimizer step. With gradient_accumulation_steps=N, forward and backward times from all N accumulated micro-batches fold into a single TraceML step. The pre-refactor TraceMLTrainer.training_step override counted each micro-batch as its own TraceML step. If you need per-micro-batch attribution under gradient accumulation, open an issue.
  • Optimizer timing. Captured by TraceML's global optimizer hooks, which are installed automatically only when running under the default traceml.init(mode="auto") path. Under manual or selective modes optimizer events are not emitted, but this applies consistently to every step, so dashboard step alignment still holds.
  • Exception safety. If training_step raises, Hugging Face does not call on_step_end. The callback defensively closes the trace_step context on the next on_step_begin, on on_train_begin (so a reused callback instance whose previous run crashed mid-step does not bleed a leaked auto-timer flag into an eval_on_start=True evaluation), and on on_train_end. Even so, the step where the exception occurred may be reported with incomplete timing or memory. The legacy subclass's with trace_step(model): super().training_step(...) pattern ran the finally cleanup deterministically. If you need precise attribution on failing steps, the legacy TraceMLTrainer path is stricter.
  • Callback registration timing. Pass TraceMLTrainerCallback() at Trainer(...) construction. Callbacks added via trainer.add_callback(...) after trainer.train() has started will not receive on_train_begin, which means the first step may be outside TraceML's callback-managed trace_step bracket.

Troubleshooting

Terminal output overlaps with TraceML

Set disable_tqdm=True in TrainingArguments.

If output is still noisy, use browser dashboard mode on single-node runs:

pip install "traceml-ai[dashboard]"
traceml run fine_tune.py --mode=dashboard

Multi-GPU run only shows one rank

Make sure you launched through TraceML with --nproc-per-node, not plain python:

traceml run fine_tune.py --nproc-per-node=4

I want a baseline without TraceML

Run the same script with TraceML disabled:

traceml run fine_tune.py --disable-traceml

This launches your script natively through torchrun without TraceML telemetry.

Full Examples

Use these examples when you want a complete runnable script. If you already have a Hugging Face training script, start with the smaller replacement pattern above.

NLP classification example Save as `fine_tune_nlp.py`:
import os

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

from traceml_ai.integrations import huggingface as traceml_hf


def main():
    traceml_hf.init()

    model_name = "prajjwal1/bert-mini"
    output_dir = "./hf_nlp_output"
    os.makedirs(output_dir, exist_ok=True)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=4,
    ).to(device)

    raw_dataset = load_dataset("ag_news", split="train[:2000]")

    def tokenize(examples):
        return tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=64,
        )

    dataset = raw_dataset.map(tokenize, batched=True)

    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=32,
        num_train_epochs=3,
        logging_steps=10,
        save_strategy="no",
        use_cpu=(device == "cpu"),
        report_to="none",
        disable_tqdm=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        callbacks=[traceml_hf.TraceMLTrainerCallback()],
    )

    trainer.train()


if __name__ == "__main__":
    main()
Run with:
traceml run fine_tune_nlp.py
Vision classification example Save as `fine_tune_vision.py`:
import os

import torch
from datasets import load_dataset
from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ToTensor
from transformers import (
    AutoImageProcessor,
    AutoModelForImageClassification,
    DefaultDataCollator,
    Trainer,
    TrainingArguments,
)

from traceml_ai.integrations import huggingface as traceml_hf


def main():
    traceml_hf.init()

    model_name = "google/vit-base-patch16-224-in21k"
    output_dir = "./hf_vision_output"
    os.makedirs(output_dir, exist_ok=True)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_processor = AutoImageProcessor.from_pretrained(model_name)
    model = AutoModelForImageClassification.from_pretrained(
        model_name,
        num_labels=10,
    ).to(device)

    dataset = load_dataset("cifar10", split="train[:2000]")
    transform = Compose(
        [
            RandomResizedCrop(
                image_processor.size["height"],
                scale=(0.8, 1.0),
            ),
            ToTensor(),
            Normalize(
                mean=image_processor.image_mean,
                std=image_processor.image_std,
            ),
        ]
    )

    def preprocess(example):
        image = example["img"].convert("RGB")
        example["pixel_values"] = transform(image)
        example["labels"] = example["label"]
        return example

    dataset = dataset.map(preprocess)

    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=16,
        num_train_epochs=2,
        logging_steps=10,
        save_strategy="no",
        report_to="none",
        disable_tqdm=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=DefaultDataCollator(),
        callbacks=[traceml_hf.TraceMLTrainerCallback()],
    )

    trainer.train()


if __name__ == "__main__":
    main()
Run with:
traceml run fine_tune_vision.py

Reference

init() takes no arguments. Call it once before constructing the Trainer to install TraceML's process-wide patches (DataLoader fetch timing, H2D Tensor.to, and the forward/backward/optimizer auto-timers). It is idempotent and returns the effective TraceMLInitConfig.

TraceMLTrainerCallback() takes no TraceML-specific arguments and records standard step-level timing and memory.

TraceMLTrainer (legacy thin wrapper) accepts:

  • everything that normal transformers.Trainer accepts
  • traceml_enabled=True|False

Next Steps