Running on Slurm¶

This guide shows the recommended way to run TraceML on a Slurm-managed GPU cluster, and how to derive TraceML's launch flags from the Slurm environment.

If you are new to multi-node TraceML, read Distributed Training first. This page focuses on the Slurm-specific glue.

A ready-to-use template lives in examples/slurm/: traceml_ddp.sbatch (the job) and launch.sh (the per-node wrapper).

Mental model¶

TraceML's launcher wraps torchrun (torch.distributed.run) and adds a telemetry aggregator:

Run one traceml run per node. TraceML spawns the per-GPU workers itself (one per GPU), so you do not start one task per GPU under Slurm.
Node 0 owns the aggregator. It is the first host in the allocation and also serves as the torchrun rendezvous master. Every other node streams its telemetry to node 0.
Use --mode=summary for multi-node runs. The live cli and dashboard modes are intended for single-node use.

Map Slurm variables to TraceML flags¶

TraceML flag	Slurm source	Notes
`--nnodes`	`$SLURM_NNODES`	Number of nodes in the allocation.
`--node-rank`	`$SLURM_NODEID`	Per-node rank `0..N-1`. Valid when you run one task per node.
`--nproc-per-node`	`$SLURM_GPUS_ON_NODE`	GPUs allocated on this node → one worker per GPU.
`--master-addr`	first host of `$SLURM_JOB_NODELIST`	`scontrol show hostnames "$SLURM_JOB_NODELIST" \\| head -n 1`.
`--run-name`	e.g. `ddp-$SLURM_JOB_ID`	Required for multi-node; must match on every node.

Why SLURM_GPUS_ON_NODE?

Prefer SLURM_GPUS_ON_NODE over SLURM_GPUS_PER_NODE. The latter is only set when you request GPUs with --gpus* flags and can carry a type prefix (for example a100:4), which is not a plain integer. SLURM_GPUS_ON_NODE is always an integer count of the GPUs on the node.

For CPU-only or task-based allocations, use $SLURM_NTASKS_PER_NODE instead as the worker count per node.

Deriving the master address¶

--master-addr (and the aggregator that workers connect to) must be node 0. Resolve it from the allocation:

export MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"

If the hostname is not routable between nodes on your cluster, resolve it to an IP address instead:

export MASTER_ADDR="$(srun --nodes=1 --ntasks=1 -w "$MASTER_ADDR" hostname --ip-address)"

The per-node expansion gotcha¶

TraceML uses static rendezvous: each node is told its rank with --node-rank. There is no automatic rank election, so every node must compute its own rank.

$SLURM_NODEID is set per node, but it is only expanded correctly if expansion happens on each node. This does not work:

# WRONG: the batch shell on node 0 expands $SLURM_NODEID once (to 0) for every
# task, so all nodes claim rank 0 and the run never rendezvous.
srun traceml run train.py --node-rank=$SLURM_NODEID ...

Instead, put the command in a small wrapper script that srun runs on each node, so the variable is expanded per node:

# traceml_ddp.sbatch
srun examples/slurm/launch.sh

# launch.sh (runs once per node)
exec traceml run examples/distributed/ddp_minimal.py \
  --mode=summary \
  --run-name="${RUN_NAME}" \
  --nnodes="${SLURM_NNODES}" \
  --node-rank="${SLURM_NODEID}" \
  --nproc-per-node="${SLURM_GPUS_ON_NODE}" \
  --master-addr="${MASTER_ADDR}" \
  --master-port=29500

MASTER_ADDR and RUN_NAME are identical on every node, so the sbatch file exports them once and Slurm propagates them to the wrapper.

The network and aggregator model¶

A multi-node TraceML run uses two TCP endpoints, both on node 0:

Purpose	Default port	Who connects
`torchrun` rendezvous (`--master-port`)	`29500`	All ranks reach node 0 to set up the process group.
TraceML aggregator (`--aggregator-port`)	`29765`	Worker nodes stream telemetry to node 0.

Node 0 starts the aggregator and, for multi-node runs, binds it to 0.0.0.0 by default so other nodes can connect.
Worker nodes connect to the aggregator at --master-addr:29765 by default (the aggregator's connect host defaults to --master-addr).

Firewall

Both ports must be reachable on node 0 from the other nodes: the torchrun master port (29500) and the TraceML aggregator port (29765). On clusters with host firewalls, allow inbound traffic to node 0 on both.

When to use `--aggregator-host`¶

Set --aggregator-host when workers should send telemetry to a different reachable address than --master-addr — for example when node 0 has a separate management/data interface or hostname that the other nodes should use for telemetry. Pass the same value on every node:

traceml run train.py ... --aggregator-host=node0-data.cluster.local

When to use `--aggregator-bind-host`¶

Multi-node runs already bind the aggregator to 0.0.0.0, so you usually do not need this. Set --aggregator-bind-host only to pin the aggregator to a specific interface on node 0 instead of all interfaces:

traceml run train.py ... --aggregator-bind-host=10.0.0.5

Minimal PyTorch DDP command¶

A single node's launch (what the wrapper runs) looks like this:

traceml run examples/distributed/ddp_minimal.py \
  --mode=summary \
  --run-name=ddp-demo \
  --nnodes=2 \
  --node-rank=0 \
  --nproc-per-node=4 \
  --master-addr=node0 \
  --master-port=29500

examples/distributed/ddp_minimal.py is a runnable DDP script that already calls traceml.init(...) and traceml.trace_step(...).

Full template¶

examples/slurm/traceml_ddp.sbatch

#!/bin/bash
#SBATCH --job-name=traceml-ddp
#SBATCH --nodes=2                 # number of nodes -> --nnodes
#SBATCH --ntasks-per-node=1       # ONE TraceML launcher per node (it spawns the GPU workers)
#SBATCH --gres=gpu:4              # all GPUs on each node; some clusters use --gpus-per-node=4 instead
#SBATCH --cpus-per-task=16        # CPU cores for dataloading; adjust to your node
#SBATCH --time=00:30:00
#SBATCH --output=traceml-%j.out

set -euo pipefail

# --- Cluster-specific setup (run BEFORE srun) ---
# module load cuda/12.x
# source ~/miniconda3/bin/activate myenv
# export NCCL_SOCKET_IFNAME=eth0
# export NCCL_DEBUG=INFO

export MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
export RUN_NAME="ddp-${SLURM_JOB_ID}"

cd "$SLURM_SUBMIT_DIR"
srun examples/slurm/launch.sh

Submit it from the repository root:

sbatch examples/slurm/traceml_ddp.sbatch

Where to find results¶

Node 0's aggregator writes the run report to:

<logs-dir>/<run-name>/final_summary.json
<logs-dir>/<run-name>/final_summary.txt

--html-report adds final_summary.html in the same directory. As with the JSON/TXT artifacts, it is written only on the aggregator-owner node (node 0); passing the flag on every node in a symmetric launch command is harmless.

--logs-dir defaults to ./logs relative to the submit directory. Put it on a shared filesystem (for example your scratch space) so the summary is reachable after the job ends. You can compare two runs later with traceml compare.

Large jobs can spend time flushing late telemetry and checkpointing SQLite at shutdown. If your cluster has slow scratch storage or delayed cross-node TCP delivery, pass --finalize-timeout-sec <seconds> in the shared launch command or set TRACEML_FINALIZE_TIMEOUT_SEC.

Cluster-specific notes¶

Adapt these for your cluster

Launcher. This template uses srun to start one task per node. Some sites wrap jobs differently (mpirun, site launcher scripts); the requirement is just that one traceml run starts per node and that $SLURM_NODEID is expanded per node.
GPU request. --gres=gpu:N and --gpus-per-node=N are both common; use whichever your cluster accepts. Keep --ntasks-per-node=1 so the single task sees all the node's GPUs.
Environment. Run module load / conda activate in the sbatch file before srun. If traceml is not found on the worker nodes, also activate the environment inside launch.sh.
Shared filesystem. launch.sh, your training script, and --logs-dir should live on storage visible to every node.
Firewall. Allow inbound traffic to node 0 on the torchrun master port (29500) and the aggregator port (29765).
NCCL. If multi-node NCCL hangs or fails to connect, pin the network interface with export NCCL_SOCKET_IFNAME=<iface> and debug with export NCCL_DEBUG=INFO.
Aggregator startup window. Node 0 must start its aggregator promptly after srun launches the tasks. Keep slow setup (module load, conda activate) before the srun line so the workers do not give up waiting for the aggregator.

No extra runtime dependency is required beyond Slurm and your existing TraceML install.