NVIDIA Fleet Intelligence: GPU monitoring with attestation

NVIDIA Fleet Intelligence is a managed service that monitors large fleets of NVIDIA data center GPUs in real time — power, temperature, performance, and ECC errors — with cryptographic GPU authenticity verification through the NVIDIA Remote Attestation Service. The service is free for owners of Vera Rubin, Blackwell, and Hopper GPUs.

NVIDIA has announced Fleet Intelligence, a managed service for real-time monitoring of large GPU fleets. The service covers five key monitoring areas: power consumption and throttling, temperature and thermal issues, performance and bottlenecks, hardware health (ECC errors, retired pages, NVLink anomalies), and configuration consistency.

Cryptographic integrity attestation

The most significant differentiator from classical GPU monitoring stacks is cryptographic authenticity verification of GPUs through NVIDIA Confidential Computing technology. A local agent collects runtime measurements — firmware digests, configurations, and states — that the GPU digitally signs with its hardware key. The signature is then verified through the NVIDIA Remote Attestation Service (NRAS), proving that the GPU is authentic NVIDIA hardware in a known, unmodified state.

For organizations running multi-tenant inference or confidential ML training, this eliminates an entire class of attacks based on substituted or modified hardware.

Technology and deployment

The system uses a lightweight host-based agent that streams GPU telemetry to NVIDIA’s cloud service. The agent is open-source and, as the announcement states, “draws on technology and IP from across the NVIDIA portfolio,” including GPUd, DCGM, and the Attestation SDK. Open source enables auditing and transparency — critical for getting security teams to approve deployment.

Installation uses Linux package managers or a Helm chart on GPU worker nodes in Kubernetes clusters.

Who can use it and what does it cost?

The service is now generally available and free for owners of NVIDIA data center GPUs. Three architectures are supported: Vera Rubin, Blackwell, and Hopper — with full attestation limited to Vera Rubin and Blackwell (Hopper lacks the required firmware path). The consumer RTX line is not included.

In practice, this means hyperscalers and enterprise customers with thousands of GPUs gain single-pane-of-glass monitoring and hardware-signed integrity verification — at no additional license cost beyond the GPUs already purchased.

Frequently Asked Questions

What is cryptographic GPU integrity verification?

The Fleet Intelligence agent collects runtime measurements — firmware digests, configurations, and states — that the GPU digitally signs with its hardware key. The signature is verified through the NVIDIA Remote Attestation Service (NRAS), proving that the GPU is authentic NVIDIA hardware in a known state — important for confidential computing scenarios.

Which GPU architectures are supported?

The service supports Vera Rubin, Blackwell, and Hopper data center GPUs. The attestation feature is limited to Vera Rubin and Blackwell (Hopper lacks the required firmware path). The consumer RTX GPU line is not supported.

How is the agent installed?

Installation uses standard Linux package managers or a Helm chart for Kubernetes deployment on GPU worker nodes. The agent is open-source, draws on technology from GPUd, DCGM, and the Attestation SDK, and streams telemetry to NVIDIA's cloud service.

NVIDIA: Fleet Intelligence — managed monitoring of large GPU fleets with cryptographic integrity verification

Cryptographic integrity attestation

Technology and deployment

Who can use it and what does it cost?

Frequently Asked Questions

Sources

Related news