arXiv:2605.30280: Qwen-VLA unifies vision, language and action for diverse robots
Qwen-VLA is a unified embodied foundation model from the Qwen team that integrates vision, language and action for diverse robotic tasks such as manipulation and navigation, across different robot platforms. The paper, with 40 authors including Junyang Lin and Jingren Zhou, achieves 97.9% on the LIBERO benchmark and generalizes strongly to new environments and embodiments.
This article was generated using artificial intelligence from primary sources.
The Qwen team has published the paper Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments, which presents a unified embodied foundation model for robotics. The paper is signed by 40 authors, including first author Qiuyue Wang and prominent members of the Qwen team Junyang Lin, Jingren Zhou and Shuai Bai.
What is Qwen-VLA and how is it built?
Qwen-VLA is an embodied foundation model — a model for artificial intelligence embodied in a physical robot, which integrates vision, language understanding and action generation (Vision-Language-Action, VLA). The model extends the existing Qwen vision-language stack and addresses the fragmentation in robotics by unifying the capabilities for manipulation and navigation into a single system.
The architecture uses a DiT-based action decoder (DiT — Diffusion Transformer) to generate continuous actions and trajectories, alongside perception and reasoning. It was trained on diverse sources: robotic manipulation, human demonstrations, simulation data and navigation sets.
How does Qwen-VLA work across different robots?
The key mechanism is embodiment-aware prompt conditioning — text descriptions specific to an individual robot define the current embodiment (the robot’s physical body). This way the same model can control different robot platforms without separate training for each.
Embodiment in robotics denotes a concrete physical configuration — the number of joints, the type of gripper, the dimensions — which differs from robot to robot. Generalization to new embodiments is one of the hardest problems in the field.
What results does Qwen-VLA achieve?
The model shows strong results on several benchmarks:
- 97.9% on the LIBERO manipulation benchmark
- 73.7% on Simpler-WidowX
- 86.1% / 87.2% on RoboTwin tasks
- 76.9% average success in real ALOHA experiments
- 26.6% zero-shot success on DOMINO dynamic manipulation
The paper highlights “consistent multi-task performance and out-of-distribution generalization” across variations of scenes and robot morphologies. The zero-shot result (success without prior training on the specific task) on the DOMINO benchmark demonstrates the ability to transfer what was learned to entirely new situations.
Why is Qwen-VLA important for robotics?
By unifying vision, language and action across tasks, environments and robot bodies, Qwen-VLA approaches the idea of a general robotic model that does not have to be retrained for each platform. Strong generalization to new environments and embodiments reduces the cost of deploying robots in the real world and positions the model as a significant step in the development of embodied AI systems.
Frequently Asked Questions
- What is Qwen-VLA?
- Qwen-VLA is a unified embodied foundation model that extends the Qwen vision-language stack by integrating vision, language understanding and action generation. It covers manipulation and navigation across different robot platforms, using a DiT-based action decoder for continuous actions and trajectories.
- What results does Qwen-VLA achieve?
- It achieves 97.9% on the LIBERO manipulation benchmark, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin tasks, 76.9% average success in real ALOHA experiments and 26.6% zero-shot success on DOMINO dynamic manipulation.
- How does Qwen-VLA support different robots?
- It uses embodiment-aware prompt conditioning, where text descriptions specific to an individual robot define the current embodiment. This lets the model operate across multiple robot platforms and generalize to new morphologies.
Related news
arXiv:2606.19808: SEVRA Saves up to 91 Percent of Tokens Through Selective Verification in Model Reasoning
arXiv:2606.20333: SoftSkill Compresses Skill Documents into 32 Latent Tokens and Boosts LiveMath by 42.1 Points
arXiv:2606.19327: Rubric-Conditioned Self-Distillation Outperforms GRPO in Reasoning Model Training