Skip to content

Architecture

TraceML architecture overview

TraceML runs as three cooperating processes during a training job. The CLI spawns an aggregator server and one or more training ranks via torchrun. Training ranks run user code in-process with TraceML hooks attached; telemetry is shipped over TCP to the aggregator, which renders the unified view.

Telemetry data flow

flowchart LR
    subgraph "Per-rank training process"
        S[Sampler] -->|append| DB[In-memory Database]
        DB -->|new rows| Sender[DBIncrementalSender]
    end
    Sender -->|length-prefixed msgpack| TCP([TCP])
    TCP --> RS[RemoteDBStore]
    subgraph "Aggregator process"
        RS --> R[Renderer]
        R --> UI[CLI / NiceGUI driver]
    end

Samplers maintain an incremental append counter per rank per table. The sender ships only new rows. The aggregator's RemoteDBStore keeps each rank's data separate, and renderers pull read-only views from it.

Layers

Layer Directory Responsibility
CLI src/traceml/cli.py Argument parsing, process spawning, signal handling
Runtime src/traceml/runtime/ In-process agent per rank; user-script executor
Aggregator src/traceml/aggregator/ TCP server, unified store, display orchestration
Samplers src/traceml/samplers/ Periodic telemetry collection (timing, memory, system)
Database src/traceml/database/ Bounded in-memory tables; rank-aware remote store
Transport src/traceml/transport/ TCP bidirectional + DDP rank detection
Renderers src/traceml/renderers/ Transform stored data into Rich/Plotly output
Display drivers src/traceml/aggregator/display_drivers/ CLI vs NiceGUI output medium
Decorators src/traceml/decorators.py User-facing instrumentation entry points
Integrations src/traceml/integrations/ Hugging Face + Lightning adapters
Utils src/traceml/utils/ Hooks, patches, memory/timing helpers

For the user-facing API surface (trace_step, TraceMLTrainer, TraceMLCallback, CLI usage), see the Public API. The source tree above is the canonical reference for internals — start from the entry points and follow the imports.

Design principles

  • Fail-open — training must never crash because telemetry broke. Sampler/transport errors are logged, execution continues.
  • Bounded overhead — every new sampler justifies its overhead. Deque-based bounded tables evict oldest records at fixed maxlen.
  • Process isolation — no shared memory. TCP + env vars only.
  • Out-of-process UI — aggregator crashes don't crash training.