Find PyTorch DataLoader bottlenecks¶

Use this guide when PyTorch training feels slow, GPU utilization is low, or distributed workers spend time waiting for an input rank to catch up.

A DataLoader bottleneck means the input path is taking enough time to affect training throughput. In TraceML, start by checking whether step time is going to dataloader fetch, host-to-device transfer, compute, wait time, or rank skew.

For whole-run triage, start with Find why PyTorch training is slow. This page stays focused on the DataLoader and input-fetch path.

Symptoms¶

A PyTorch DataLoader bottleneck can look like:

low or uneven GPU utilization
long gaps between training steps
real data is slower than synthetic data
batch collation, decoding, tokenization, or preprocessing feels expensive
one DDP or FSDP rank reaches compute later than the others
changing num_workers or input preprocessing changes throughput

Low GPU utilization alone is not proof of a DataLoader bottleneck. Confirm where step time goes before changing the input pipeline.

Run TraceML¶

If your script is not instrumented yet, start with the Quickstart.

Run your training script in the default summary mode:

traceml run train.py

TraceML writes:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

You can re-print the saved text summary later:

traceml view logs/<run_name>/final_summary.json

What to look for¶

Start with the Step Time section.

For DataLoader problems, the most relevant diagnoses are:

INPUT-BOUND: dataloader or input work is taking a large share of the typical step
INPUT STRAGGLER: one rank has meaningfully more dataloader burden than a typical rank

Example:

Step Time
- Diagnosis: INPUT STRAGGLER
- Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms
- Why: r0 input was slower than median global rank (254.5/3.8ms).

Read this as:

input time is large enough to affect training speed
rank 0 is slower in the input path than the typical rank
other ranks may wait because distributed training follows the slowest rank

If the diagnosis is INPUT-BOUND, inspect the whole input path. If the diagnosis is INPUT STRAGGLER, inspect the called-out rank first.

Check the input path¶

Change one thing at a time, then rerun TraceML.

Good first checks:

increase DataLoader(num_workers=...) gradually
reduce expensive CPU transforms, decoding, tokenization, or collation
move repeated preprocessing out of the training loop
check slow storage, network filesystems, or uneven dataset shards
compare against a synthetic-data run
in DDP or FSDP, inspect the worst input rank for host-side jitter or uneven batches

For CUDA training, also check whether your existing input path uses pin_memory=True and non-blocking host-to-device transfer where appropriate. TraceML separates dataloader fetch, H2D, compute, and wait time. Use that split to avoid treating a transfer, compute, or wait issue as a DataLoader issue.

Compare before and after¶

After changing the input path, compare the old and new final summaries:

traceml compare old_run/final_summary.json new_run/final_summary.json

Use the compare output to check whether total step time, input time, wait time, or the diagnosis changed.

When this is not the right guide¶

Use a different guide when the primary symptom is not DataLoader fetch time:

low or moderate GPU utilization without an input diagnosis: Debug Low GPU Utilization
distributed rank skew beyond the input path: Debug DDP Rank Stragglers
rising GPU memory: Find PyTorch Memory Creep
run-to-run slowdown after a change: Compare Runs

Custom loaders and Ray Data¶

TraceML automatically instruments torch.utils.data.DataLoader when initialized with traceml.init(mode="auto").

If your input iterator is not a PyTorch DataLoader, wrap the fetch path:

train_loader = traceml.wrap_dataloader_fetch(train_loader)

This is the pattern used for Ray Data iterators, because Ray iter_torch_batches(...) is not a PyTorch DataLoader.

When to use a heavier profiler¶

Use TraceML first to decide whether the input path is likely the problem.

Use torch.profiler when you need operator-level or timeline detail for a specific window. Use Nsight Systems when you need lower-level CUDA or system timeline detail.

TraceML does not replace those tools. It tells you whether the next profiler run should focus on DataLoader fetches, H2D copies, a specific rank, or a specific slow window.