NVIDIA Nemotron 3 Nano Omni: 30B-A3B MoE Multimodal Model with 9× Higher Throughput for AI Agents
NVIDIA introduced Nemotron 3 Nano Omni on April 28, 2026 — an open-source 30B-A3B hybrid mixture-of-experts model that unifies vision, audio, language, video, documents, and GUI screenshots in a single architecture with a 256K context window. Throughput is 9× higher than other open omni models at the same interactivity level, and the model leads six leaderboards for document, video, and audio understanding. It is available on Hugging Face, OpenRouter, build.nvidia.com, and 25+ partner platforms, with early adopters including Palantir, Foxconn, and Eka Care.
NVIDIA introduced Nemotron 3 Nano Omni on April 28, 2026 — an open-source multimodal model that unifies vision, audio, and language in a single architecture for AI agents. The model arrives as a 30B-A3B hybrid mixture-of-experts (MoE) with a 256K context window and throughput 9× higher than other open omni models at the same level of interactivity. Its positioning is a direct challenge to recent multimodal releases from Mistral, Meta, and Alibaba.
What Modalities Does Nemotron Nano Omni Process?
The model accepts and processes six input types in a single architecture: text, images, audio, video, documents with charts, and GUI screenshots. Specific technical components include Conv3D and an EVS (efficient video scaling) module for video, and the system has been tested on 1920×1080 resolution displays for GUI navigation. It is designed primarily for AI agents that need to combine UI observation, document reading, and user conversation in the same workflow.
What Does 9× Higher Throughput Mean for Inference?
NVIDIA claims that Nano Omni achieves 9× more tokens per second compared to other open omni models while maintaining identical interactivity. In practice, this means that agentic workflows previously bottlenecked by the latency of multimodal processing — for example, reading hundreds of pages of a document while simultaneously clicking through a GUI — can now run in real time. The model currently leads in six categories on public leaderboards for document, video, and audio understanding, although NVIDIA does not cite specific benchmark numbers in the announcement.
Where Is It Available and Who Is Already Using It?
The model is available via Hugging Face, OpenRouter, NVIDIA’s build.nvidia.com portal, and more than 25 partner platforms. Active early users include Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir, and Pyler. Dell Technologies, Docusign, Infosys, Oracle, and Zefr are in the evaluation phase. The broad distribution and list of corporate users suggest that NVIDIA prepared the model for immediate enterprise deployment, not just research testing.
Frequently Asked Questions
- What is Nemotron 3 Nano Omni?
- An open-source 30B-A3B hybrid mixture-of-experts model that processes vision, audio, language, video, documents, charts, and GUI screenshots in a single architecture. The context window is 256K tokens, using Conv3D and EVS technologies for efficient video processing.
- How much faster is it than the competition?
- 9× higher throughput than other open omni models at the same level of interactivity. The model leads in six categories on leaderboards for document, video, and audio understanding.
- Who is already using it?
- Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir, and Pyler are actively using the model. Dell Technologies, Docusign, Infosys, Oracle, and Zefr are currently evaluating it for their own deployments.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic closes 1M context beta for Sonnet 4.5 and Sonnet 4 — migration to 4.6 required
AstaBench Spring 2026: Claude Opus 4.7 leads with 58% in scientific AI benchmark, GPT-5.5 half the cost
PyTorch SMG: CPU-GPU disaggregation in LLM serving delivers 3.5× output throughput for Llama 3.3 70B FP8, already in production on Google Cloud, Oracle, and Alibaba