Allen Institute: MolmoAct 2 is the first open-source robotics foundation model to outperform GPT-5 and Gemini 2.5 Pro
MolmoAct 2 is an open-source robotics foundation model released on May 5 by Allen Institute for AI. The model achieves 63.8/100 on embodied-reasoning benchmarks, outperforms GPT-5 and Gemini 2.5 Pro, accelerates inference 37×, and is the first base model with built-in bimanual capabilities.
This article was generated using artificial intelligence from primary sources.
Allen Institute for AI (AI2) released MolmoAct 2 on May 5, 2026 — the first open-source robotics foundation model to outperform closed systems such as Physical Intelligence and frontier models GPT-5 and Gemini 2.5 Pro on embodied-reasoning benchmarks.
A robotics foundation model is a large base model trained on a combination of visual and action data, enabling a robot to execute diverse physical tasks from natural language without task-specific training.
What are the three key changes in MolmoAct 2?
The first change is raw performance: the model achieves 63.8/100 on embodied-reasoning benchmarks, placing it ahead of GPT-5 and Gemini 2.5 Pro. The second is a dramatic speedup — by optimizing the KV-cache bridge between the vision model and action expert, inference is accelerated 37×, from 6.7 seconds to 180 milliseconds per action. The third is built-in bimanuality — coordinated two-arm manipulation without per-task fine-tuning, making MolmoAct 2 the first base model of its kind.
The model is built on the Molmo 2-ER base trained on approximately 3 million additional embodied-reasoning examples.
What do the benchmark results look like in practice?
On the LIBERO benchmark, a standard academic test for robot learning, MolmoAct 2 achieves 97.2% success. On real-world tasks with a Franka arm robot, success is 87.1%, while on the new MolmoBot household benchmark (a set of household tasks) it achieves 20.6% — twice the score of the second-place model.
The gap between LIBERO and MolmoBot shows how challenging real messy household conditions remain: even a model solving 97% of academic tasks succeeds in only one-fifth of real household scenarios.
What does AI2 release alongside the model?
In addition to model weights, AI2 releases the YAM Dataset with over 720 hours of bimanual demonstrations — 30 times more than the original MolmoAct dataset — as well as complete training code and a reference hardware setup that other labs can replicate.
All artifacts — weights, dataset, code, and hardware specifications — are publicly available. This makes MolmoAct 2 the first serious open answer to closed robotics foundation models, giving researchers, universities, and smaller companies a foundation to build their own applications without licensing restrictions.
Frequently Asked Questions
- What is a robotics foundation model?
- A robotics foundation model is a large base model trained on visual and action data that enables robots to perform tasks from natural language instructions without fine-tuning for each new task.
- What are bimanual capabilities in robotics?
- Bimanual capabilities mean the robot coordinates two arms on a single task — for example, holding a container with one arm while pouring with the other. MolmoAct 2 is the first base model to do this without per-task training.
- What is the YAM Dataset?
- The YAM Dataset is a new public collection of over 720 hours of bimanual robot demonstrations released by AI2 alongside the model — 30 times more demonstrations than the original MolmoAct dataset.
Related news
IBM Granite 4.1: open-source family of 3B/8B/30B Apache 2.0 models trained on 15T tokens shows that a dense 8B model matches 32B MoE
Marco-MoE: Open-Source Multilingual MoE with 5% Active Parameters Outperforms Dense Models with 3-14× More Activations
ONNX v1.21.0 releases with Opset 26: new CumProd and BitCast operators, 2-bit type support, and Python 3.14 free-threading experiment