Ollama 0.30: llama.cpp, GGUF, and Up to 20% Faster Inference

Ollama 0.30 brings integration with llama.cpp for better performance and GGUF model compatibility, with up to 20% faster throughput on NVIDIA GPUs. It expands hardware support with Vulkan on AMD and Intel devices and adds support for tool-calling. It complements the existing MLX engine for Apple silicon.

Ollama published the Ollama 0.30 release on its blog on June 5, 2026, which puts in the foreground integration with llama.cpp, support for the GGUF format, and significantly better performance. This is a notable step for the popular tool for running language models locally, which with this release expands both speed and the range of supported hardware.

What does integration with llama.cpp and GGUF bring?

The central innovation of the release is integration with llama.cpp, a widely used open-source project for running (inference of) language models. This integration brings better performance and, equally important, GGUF compatibility for models.

GGUF is a file format for storing quantized (compressed) models, very widespread in the community. With its support, Ollama 0.30 enables users to easily run a large number of models that already exist in that format, significantly expanding the catalog of available models.

How much faster is Ollama 0.30?

The performance gains are concrete. Thanks to the new integration, Ollama 0.30 achieves up to 20% faster throughput on NVIDIA GPUs. For users who run models locally on such hardware, this means noticeably faster responses and better utilization of graphics cards.

Speed is not the only improvement on the hardware side. The release expands hardware support by enabling Vulkan — a graphics and compute API — on AMD and Intel devices. This extends accelerated execution beyond the NVIDIA ecosystem to a wider range of computers.

Which new model families are supported?

Ollama 0.30 adds compatibility with several new model families. Among them are LFM, Prism, and Unsloth fine-tuned models available from Hugging Face, the largest platform for sharing models.

This expansion directly follows on from GGUF support: since a large part of the community of fine-tuned models is published precisely in that format, users get a simple path to a diverse selection of customized models without additional conversions.

What about Apple silicon and tool-calling?

For users of Apple hardware, the release complements the existing MLX engine for Apple silicon. With this, Ollama does not replace but rather expands availability across diverse hardware — from Apple chips through NVIDIA GPUs to AMD and Intel devices.

The final highlighted innovation is support for tool-calling, which allows the model to call external functions during operation. This opens integration with coding agents and assistants directly from the command line, so locally run models can perform more complex tool-assisted tasks — for example, retrieving data, running scripts, or working with local tools without sending queries to the cloud.

All of the above makes Ollama 0.30 a well-rounded release: integration with llama.cpp and GGUF support broaden the catalog of models, up to 20% faster throughput and Vulkan speed up execution across more types of hardware, and tool-calling opens up more capable, agent-assisted scenarios. By combining faster execution, broader hardware support, and tool-calling, Ollama 0.30 makes local AI both faster and more capable, while retaining the privacy advantage that comes from running models on your own computer.

Frequently Asked Questions

What is GGUF and why does its support matter?

GGUF is a file format for storing quantized language models, widely used in the open-source community. Support for GGUF in Ollama 0.30 means users can more easily run a large number of models available in that format, including many fine-tuned models from Hugging Face.

How much faster is Ollama 0.30?

Thanks to integration with llama.cpp, Ollama 0.30 achieves up to 20 percent faster throughput on NVIDIA GPUs. In addition, by enabling Vulkan it expands hardware support to AMD and Intel devices, speeding up work across a wider range of computers.

What does tool-calling support bring?

Tool-calling allows the model to call external functions and tools while generating a response. In Ollama 0.30 this opens direct integration with coding agents and command-line assistants, so locally run models can perform more complex, tool-assisted tasks.

Ollama 0.30: llama.cpp Integration, GGUF Support, and Up to 20% Faster Inference

What does integration with llama.cpp and GGUF bring?

How much faster is Ollama 0.30?

Which new model families are supported?

What about Apple silicon and tool-calling?

Frequently Asked Questions

Sources

Related news