arXiv:2606.02800: NVIDIA Cosmos 3 — omnimodal world model for physical AI
Cosmos 3 is NVIDIA's new omnimodal world model released as an arXiv preprint that, within a single mixture-of-transformers architecture, simultaneously processes and generates language, images, video, audio and action sequences. The model targets embodied and physical AI and ships with open-source code, checkpoints, datasets and benchmarks.
This article was generated using artificial intelligence from primary sources.
On 1 June 2026, NVIDIA released an arXiv preprint titled “Cosmos 3: Omnimodal World Models for Physical AI”. The paper introduces Cosmos 3, a model that, within a single unified architecture, simultaneously processes and generates language, image, video, audio and action sequences. The goal is to build a foundational world model for physical AI — robots and embodied agents that operate in the real world. The author list names 294 contributors.
What does Cosmos 3 actually do?
Cosmos 3 brings into one framework what was previously separate: vision-language models, video generators, world simulators and action models. Instead of distinct systems for understanding and generation, a single model takes in and produces multiple modalities at once. It thereby covers both perception (understanding a scene) and prediction (how a scene will evolve after an action), which is essential for controlling a robot.
Mixture-of-transformers architecture
The system is built on a mixture-of-transformers design — an architecture in which multiple transformer components share a common framework and process different kinds of data, instead of a single monolithic model. According to the paper, this approach enables “highly flexible input-output configurations”: the model can take in text and an image and return video or an action sequence, depending on the task. The term omnimodal means that all five modalities — language, image, video, audio and actions — live inside the same model.
How good is it compared to the competition?
According to the preprint, Cosmos 3 achieves the best results (state-of-the-art) on a diverse set of understanding and generation tasks. The authors state that, at the time of writing, Artificial Analysis ranked Cosmos 3 as the best open-source Text-to-Image and Image-to-Video model, while RoboArena rated it the best policy model — that is, the model that decides which actions a robot takes. These claims are from the paper itself and refer to the leaderboards cited at the time of publication.
Open-source package
Alongside the paper, NVIDIA is opening up the entire package. The code and model checkpoints, curated synthetic datasets and an evaluation benchmark are available. The materials are published under the Linux Foundation’s OpenMDW-1.1 license, the repository is on GitHub (github.com/nvidia/cosmos), and the model hub on Hugging Face. The first version of the preprint was posted on 1 June 2026, and a revised version on 5 June 2026.
Why it matters
By releasing the complete package — code, checkpoints, data and benchmarks — NVIDIA lowers the barrier to physical-AI research beyond large labs. World models that equally well understand and generate multiple modalities are considered one of the key ingredients for scalable robotics and embodied agents, so the true performance of Cosmos 3 will become clear once the community starts testing it on its own hardware and tasks.
Frequently Asked Questions
- What is a world model?
- A world model is an AI system that learns an internal representation of how the world behaves, so it can predict and simulate the consequences of actions. It is used for robotics and embodied agents that operate in a physical environment.
- Is Cosmos 3 available as open source?
- Yes. NVIDIA released the code and checkpoints under the Linux Foundation's OpenMDW-1.1 license, together with synthetic datasets and an evaluation benchmark, on GitHub and Hugging Face.
Related news
arXiv:2606.19808: SEVRA Saves up to 91 Percent of Tokens Through Selective Verification in Model Reasoning
arXiv:2606.20333: SoftSkill Compresses Skill Documents into 32 Latent Tokens and Boosts LiveMath by 42.1 Points
arXiv:2606.19327: Rubric-Conditioned Self-Distillation Outperforms GRPO in Reasoning Model Training