🔴 📦 Open Source Wednesday, May 6, 2026 · 2 min read ·

Allen Institute: MolmoAct 2 is the first open-source robotics foundation model to outperform GPT-5 and Gemini 2.5 Pro

Editorial illustration: dual-arm Franka robot with an open box in a laboratory, symbolizing the open-source MolmoAct 2 foundation model

MolmoAct 2 is an open-source robotics foundation model released on May 5 by Allen Institute for AI. The model achieves 63.8/100 on embodied-reasoning benchmarks, outperforms GPT-5 and Gemini 2.5 Pro, accelerates inference 37×, and is the first base model with built-in bimanual capabilities.

🤖

This article was generated using artificial intelligence from primary sources.

Allen Institute for AI (AI2) released MolmoAct 2 on May 5, 2026 — the first open-source robotics foundation model to outperform closed systems such as Physical Intelligence and frontier models GPT-5 and Gemini 2.5 Pro on embodied-reasoning benchmarks.

A robotics foundation model is a large base model trained on a combination of visual and action data, enabling a robot to execute diverse physical tasks from natural language without task-specific training.

What are the three key changes in MolmoAct 2?

The first change is raw performance: the model achieves 63.8/100 on embodied-reasoning benchmarks, placing it ahead of GPT-5 and Gemini 2.5 Pro. The second is a dramatic speedup — by optimizing the KV-cache bridge between the vision model and action expert, inference is accelerated 37×, from 6.7 seconds to 180 milliseconds per action. The third is built-in bimanuality — coordinated two-arm manipulation without per-task fine-tuning, making MolmoAct 2 the first base model of its kind.

The model is built on the Molmo 2-ER base trained on approximately 3 million additional embodied-reasoning examples.

What do the benchmark results look like in practice?

On the LIBERO benchmark, a standard academic test for robot learning, MolmoAct 2 achieves 97.2% success. On real-world tasks with a Franka arm robot, success is 87.1%, while on the new MolmoBot household benchmark (a set of household tasks) it achieves 20.6% — twice the score of the second-place model.

The gap between LIBERO and MolmoBot shows how challenging real messy household conditions remain: even a model solving 97% of academic tasks succeeds in only one-fifth of real household scenarios.

What does AI2 release alongside the model?

In addition to model weights, AI2 releases the YAM Dataset with over 720 hours of bimanual demonstrations — 30 times more than the original MolmoAct dataset — as well as complete training code and a reference hardware setup that other labs can replicate.

All artifacts — weights, dataset, code, and hardware specifications — are publicly available. This makes MolmoAct 2 the first serious open answer to closed robotics foundation models, giving researchers, universities, and smaller companies a foundation to build their own applications without licensing restrictions.

Frequently Asked Questions

What is a robotics foundation model?
A robotics foundation model is a large base model trained on visual and action data that enables robots to perform tasks from natural language instructions without fine-tuning for each new task.
What are bimanual capabilities in robotics?
Bimanual capabilities mean the robot coordinates two arms on a single task — for example, holding a container with one arm while pouring with the other. MolmoAct 2 is the first base model to do this without per-task training.
What is the YAM Dataset?
The YAM Dataset is a new public collection of over 720 hours of bimanual robot demonstrations released by AI2 alongside the model — 30 times more demonstrations than the original MolmoAct dataset.