arXiv: vla.cpp runs VLA on 1.3 GiB

A new paper presents vla.cpp, a C++ inference engine for running Vision-Language-Action policies on resource-constrained robotic hardware. The engine reaches SOTA level on the LIBERO-Object benchmark and runs BitVLA with just 1.3 GiB of memory.

arXiv published a paper on 6 June 2026 (label arXiv:2606.08094, version v1) that presents vla.cpp, a C++ inference engine for running Vision-Language-Action models on modest robotic hardware. It is an open-source approach that seeks to remove the dependence on powerful graphics cards.

What is vla.cpp and what is it for?

vla.cpp is a C++ inference engine (a runtime for inference) for running Vision-Language-Action (VLA) policies. VLA models link visual input, language instructions and actions, so they enable a robot to carry out a task based on what it sees and hears.

The key intention is to run these policies on resource-constrained robotic hardware instead of on workstation GPUs. This brings VLA models closer to real robots, which as a rule do not have a powerful graphics card on board.

How does vla.cpp perform on the benchmark?

According to the paper, the engine reaches SOTA (state-of-the-art) level performance on the LIBERO-Object benchmark, and does so within a single episode. This means it does not sacrifice the quality of task execution in order to fit onto weaker hardware.

Particularly notable is running the BitVLA model at full success rate with just 1.3 GiB of memory. Such a small memory footprint makes the engine usable on devices that would otherwise be insufficient for modern VLA models.

How many architectures does vla.cpp support?

The engine supports 7 architectures across 5 backbone model families (foundational networks) and 4 action-head types (modules that turn a representation into an action). It does all this through a unified protocol, which makes switching from one model to another easy without major changes.

Such universality is important for researchers and engineers who want to try different VLA models on the same device. Instead of separate implementations for each architecture, vla.cpp offers a single shared execution layer.

How was latency reduced?

To speed up operation, the authors introduced a custom GEMM optimization. GEMM (General Matrix Multiply) is the matrix multiplication operation that forms the core of neural networks, so its optimization directly affects speed.

That tailored optimization cuts the latency of the BitVLA model by 4.5 times. Lower latency means faster robot reactions, which is crucial for tasks where timely action matters.

On what hardware was it tested?

The engine was tested across three hardware tiers, from a consumer GPU (a consumer graphics card) to an embedded module with 8 GB of memory. That range covers both development and embedded environments such as those found in real robots.

The paper thereby shows that VLA models can be run not only in the laboratory, but also on embedded equipment with limited resources. This is an important step toward robots that reason locally, without relying on external, powerful servers.

Frequently Asked Questions

What is vla.cpp?

vla.cpp is a C++ inference engine (a runtime for inference) for running Vision-Language-Action (VLA) policies on resource-constrained robotic hardware instead of on powerful workstation GPUs. Its goal is to bring VLA models to low-memory devices.

How much memory is required?

The engine runs the BitVLA model at full success rate with just 1.3 GiB of memory. It was tested across three hardware tiers, from a consumer GPU to an embedded module with 8 GB of memory, which makes it applicable on very modest equipment.

How many architectures does it support?

vla.cpp supports 7 architectures across 5 backbone model families and 4 action-head types, all through a unified protocol. A custom GEMM optimization further cuts the latency of the BitVLA model by 4.5 times.

arXiv:2606.08094: vla.cpp runs Vision-Language-Action models on 1.3 GiB of memory