vLLM on NVIDIA DGX Spark: guide to local inference

The vLLM team published a practical guide for running vLLM on the NVIDIA DGX Spark system based on the GB10 chip. The guide covers unified memory behavior, serving the NVFP4 Nemotron-3-Super model, Docker deployment, Prometheus metrics and local evaluation results on the new edge hardware.

The vLLM team published on 1 June 2026 a technical guide titled “vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation”. The text describes how to run vLLM on the NVIDIA DGX Spark system based on the GB10 chip and focuses on local inference with a unified CPU-GPU memory architecture. vLLM is an open-source engine for production serving of large language models.

How does unified memory work?

The DGX Spark uses a shared 128 GB memory pool in which the CPU, GPU and model weights live together. According to the guide, this unified memory model makes it possible to serve larger NVFP4 models locally — up to roughly 200 billion parameters, depending on the architecture and configuration. NVFP4 is a 4-bit weight format that reduces a model’s memory footprint, so larger models fit into the available memory.

Which model serves as the example?

The guide notes that “100-130B MoE NVFP4 models with roughly 10-15B active parameters are a good choice” for this system. MoE (mixture-of-experts) means only part of the parameters is active per query, which conserves resources. The concrete example is Nemotron-3-Super-120B-A12B-NVFP4.

Configuration and Docker deployment

To run vllm serve, the guide lists key flags: --gpu-memory-utilization 0.85 (the share of unified memory vLLM is allowed to occupy), --max-model-len 131072, --max-num-seqs 4 (the limit on concurrent requests) and --reasoning-parser nemotron_v3. The official Docker image vllm/vllm-openai:cu130-nightly exposes OpenAI-compatible endpoints at http://localhost:8000/v1, with Prometheus metrics at /metrics.

What are the local evaluation results?

Evaluation on a single Spark showed a decode throughput of 22.7-23.7 tok/s across different scenarios, with TTFT (time to first token) ranging from 0.42 seconds for a short prompt to 3.85 seconds for a long prompt. The guide notes that warming up the JIT compiler resolves the initial cold-start latency (about 25 seconds), while KV-cache utilization with a single user typically stays below 5%.

Why the guide is useful

The DGX Spark belongs to a new wave of NVIDIA edge hardware, and a practical manual like this shows that serious NVFP4 models can be served locally, without large data center infrastructure. For development teams, that means a cheaper and more private path to production inference on their own device.

Frequently Asked Questions

What is vLLM?

vLLM is an open-source engine for production serving of large language models. It optimizes throughput and memory management and offers an OpenAI-compatible API for inference.

What is unified memory on the DGX Spark?

The DGX Spark has a shared 128 GB memory pool in which the CPU, GPU and model weights live together. This makes it possible to serve larger NVFP4 models locally without separate GPU memory.

vLLM: running on NVIDIA DGX Spark / GB10 systems