Runtime GPU Telemetry · Closed-Loop Power Management

Wake up an engineer.
Let the cluster heal itself.

HBMGuard is a bare-metal C++ agent that detects GPU thermal throttling in real-time, cuts power draw by hundreds of watts, and automatically drains & restores Slurm nodes — without touching your application code.

Request a Pilot See how it works

hbmguard_agent · node-a100-01 · live

[2026-06-18 08:41:03] HBMGuard v0.9 — DCGM Profiling API connected

[08:41:05] SM_CLOCK: 1410 MHz DRAM_ACTIVE: 94% PWR: 398W ⚠ Memory Wall detected

[08:41:05] ECO_CTRL: cutting power limit → 150W

[08:41:08] SM_CLOCK degraded (sample 1/3): 812 MHz

[08:41:11] SM_CLOCK degraded (sample 2/3): 798 MHz

[08:41:14] SM_CLOCK degraded (sample 3/3): 805 MHz → threshold crossed

[08:41:14] SLURM: scontrol update node=a100-01 state=drain reason="HBMGuard:thermal_wall"

[08:42:31] SM_CLOCK recovered (sample 3/3): 1395 MHz → healthy

[08:42:31] SLURM: scontrol update node=a100-01 state=resume

[08:42:31] ✓ Node back in pool. Elapsed: 88s. Engineers paged: 0. █

250W

Typical power cut
during Memory Wall

<15ms

Control loop latency
C++ bare-metal

3-sample

False-positive shield
before draining node

Application code
changes required

Thermal throttling is a silent tax
on your GPU compute

VISIBILITY

You're paying for full utilization

HBM thermal throttling degrades compute silently. Your dashboards show "GPU busy" while SM Clock quietly drops 30–50%. The bill doesn't change. The throughput does.

TOOLING

NVML can't see what DCGM can

Standard monitoring tools (node_exporter, nvidia-smi) expose power and temperature. They miss the SM occupancy vs. DRAM_ACTIVE dissonance that defines the Memory Wall — the actual throttle trigger.

OPERATIONS

Slurm doesn't know the node is struggling

Even with perfect local telemetry, a throttled node keeps receiving new jobs from the scheduler. The thermal event compounds. Eventually a human gets paged — at 2am — to run scontrol drain. HBMGuard closes this loop automatically, before the page goes out.

The control loop, explained

DETECT

Memory Wall detection via DCGM Profiling API

The agent samples SM occupancy, DRAM_ACTIVE bandwidth, and power draw every ~2 seconds using DCGM's Profiling API — bypassing NVML's limitations. A Memory Wall condition is flagged when DRAM_ACTIVE is saturated while SM efficiency collapses, creating the characteristic power-vs-work dissonance.

ACT

Closed-loop power cut: 400W → 150W in milliseconds

On detection, the C++ ECO Controller issues an NVML power cap command. The power cut forces the GPU to operate within its thermal envelope without interrupting the running workload. This is confirmed live on a GCP A100 Spot instance with Grafana validation.

CORDON

3-sample shield → Slurm DRAIN, zero false positives

If SM Clock degradation persists across 3 consecutive samples, the agent issues a Slurm drain with a machine-readable reason string. New jobs are blocked. The 3-sample buffer eliminates false positives from transient spikes — a deliberate tradeoff tuned from live workload data.

RECOVER

Auto-RESUME the moment hardware is healthy

Recovery applies the same 3-sample logic in reverse. Three consecutive healthy SM Clock readings trigger an automatic state=resume. The node rejoins the cluster pool without human intervention. The full IDLE → DRAIN → IDLE state machine is validated and running.

Built for the cluster,
not the dashboard

What's running today

C++ ECO/Power Controller

DCGM Profiling API · NVML actuator · <15ms loop

Slurm Integration

IDLE → DRAIN → IDLE state machine · drain reason strings

Prometheus Exporter

4 real metrics · flush_thread · zero hot-path overhead

SQLite Telemetry Store

WAL mode · async writer · real DCGM data confirmed

Architectural principles

C++ controls the hot path

No network dependency for emergency power cuts

Prometheus is read-only

Observability layer never triggers control decisions

Zero application changes

Deploy as a sidecar daemon; workloads are unaware

SM Clock, not temperature

Temperature is a lagging indicator; clock degradation is the signal

Early Access · Pilot Program

Run HBMGuard on
your cluster

We're working with a small number of AI inference providers and HPC operators to validate HBMGuard on their GPU fleets. If you're running A100 / H100 / H200 nodes on Slurm and you're tired of thermal surprises, let's talk.

Wake up an engineer.Let the cluster heal itself.

Thermal throttling is a silent taxon your GPU compute