NVIDIA Nemotron 3 Ultra — 550B MoE open model

NVIDIA released Nemotron 3 Ultra, an open-weight Mixture-of-Experts model with 550 billion total parameters and 55B active per token. The model targets long agentic workflows with up to 30% lower cost compared to other leading open models. It is available in Ollama, and vLLM provided Day-0 serving support.

NVIDIA released on June 4, 2026, Nemotron 3 Ultra, an open-weight model of the Mixture-of-Experts (MoE) architecture with 550 billion total parameters and 55 billion active per token, optimized for NVFP4 (a 4-bit floating-point format). The model is intended for long agentic workflows, and its availability is confirmed by two sources: the Ollama library and vLLM with Day-0 serving support.

What is Nemotron 3 Ultra and how is it built?

Nemotron 3 Ultra is an MoE model, meaning only a subset of the network is activated per token — here 55 billion of the total 550 billion parameters. This architecture enables the capacity of a very large model at significantly lower inference cost, because the entire network is not active at every step. NVIDIA optimized the model for NVFP4, a 4-bit floating-point format that further reduces memory and compute requirements during serving.

What kinds of tasks is it intended for?

Nemotron 3 Ultra is explicitly built for long agentic workflows. This includes agent orchestration (coordinating multiple agents), coding agents, and deep-research tasks that span hundreds of tool calls — individual calls to external tools within a single task. For such scenarios, a large context window is key, which for Nemotron 3 Ultra is 256K tokens, with an announced extension to 1 million tokens.

What performance does NVIDIA report?

According to the documentation, Nemotron 3 Ultra leads in accuracy on agent productivity, instruction following, and long-context tasks. The key advantage NVIDIA emphasizes is economy: the model delivers up to 30% lower cost compared to other leading open models. The combination of the MoE architecture, the NVFP4 format, and selective parameter activation makes this saving possible without loss of capacity.

How do you run and serve the model?

For end users, the model is available in Ollama via the simple command ollama run nemotron-3-ultra:cloud. For production serving, vLLM provided Day-0 support — that is, support available the same day as the model itself. vLLM supports both BF16 and NVFP4 checkpoints, provides guidance for GPU configuration, and offers OpenAI-compatible APIs. In addition, integration with NeMo RL for fine-tuning is planned, allowing the model to be further adapted to specific agentic domains.

Why is the release important?

The release of Nemotron 3 Ultra is significant because NVIDIA combines a very large MoE model with an open-weight approach and simultaneous support from two leading ecosystems for local running (Ollama) and production serving (vLLM). The focus on agentic workflows, long context, and lower cost positions the model for organizations building complex, multi-step agentic systems without dependence on closed APIs.

Frequently Asked Questions

How many parameters does NVIDIA Nemotron 3 Ultra have?

Nemotron 3 Ultra is a Mixture-of-Experts (MoE) model with 550 billion total parameters, of which 55 billion are active per token. This MoE architecture activates only part of the network per token, which reduces inference cost while retaining the capacity of a large model.

What is Nemotron 3 Ultra optimized for?

The model is built for long agentic workflows — agent orchestration, coding agents, and deep-research tasks that span hundreds of tool calls. It is optimized for NVFP4, a 4-bit floating-point format, and has a 256K-token context window with an announced extension to 1M.

How do you run Nemotron 3 Ultra?

The model is available in Ollama via the command `ollama run nemotron-3-ultra:cloud`. For serving, vLLM provided Day-0 support with BF16 and NVFP4 checkpoints, OpenAI-compatible APIs, and integration with NeMo RL for fine-tuning.

NVIDIA: Nemotron 3 Ultra — a 550B open-weight MoE for long agentic workflows