PyTorch ExecuTorch MLX: 3–6× faster on Apple Silicon

The PyTorch team released the experimental ExecuTorch MLX Delegate — a backend that leverages the Apple MLX framework and Metal GPU kernels for 3 to 6 times greater throughput on Apple Silicon chips. Supports Llama 3.2, Qwen 3, Phi-4 mini, Whisper and Voxtral real-time streaming transcription.

The PyTorch team released the experimental ExecuTorch MLX Delegate — a new backend that accelerates PyTorch models on macOS using the Apple MLX framework and optimized Metal GPU kernels. The result is generative AI workloads with 3 to 6 times greater throughput compared to existing ExecuTorch delegates on macOS.

How does the ExecuTorch MLX Delegate work?

ExecuTorch is PyTorch’s runtime for on-device inference that exports the model via torch.export and then lowers it into a .pte format ready for execution. The MLX Delegate adds a new step: MLXPartitioner analyzes the exported graph and delegates compatible subgraphs directly to Apple MLX, which executes them via the Apple Silicon GPU.

The workflow is three-step:

Model export with torch.export
Lowering with to_edge_transform_and_lower using MLXPartitioner
Running the .pte file through the ExecuTorch runtime

The delegate supports approximately 90 ATen operations, including quantized matmul, multi-head attention, rotary position embeddings and Mixture-of-Experts routing.

Which models are supported?

Is Voxtral truly ready for live transcription?

Yes — the MLX Delegate supports Mistral Voxtral Realtime (4B) with live microphone input for real-time streaming transcription directly on a Mac, without an internet connection.

Full list of supported models:

LLMs: Llama 3.2 (1B), Qwen 3 (0.6B, 1.7B, 4B), Phi-4 mini (3.8B), Gemma 3 (1B, 4B)
MoE models: Qwen 3.5 35B-A3B with 256 experts and top-8 routing
Speech-to-text: OpenAI Whisper (tiny to large-v3-turbo), NVIDIA Parakeet TDT (0.6B), Mistral Voxtral (3B)

Quantization is available in BF16, FP16, FP32 and 2/4/8-bit affine quantization via TorchAO, as well as NVFP4.

Limitations and status

The delegate is marked as experimental — APIs and supported features may change. Acceleration is available exclusively on Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU support; Intel Mac computers are not supported. All other platforms (Android, Linux, Windows) continue to use existing ExecuTorch delegates.

Source code is available in the PyTorch ExecuTorch repository on GitHub.

Frequently Asked Questions

What is ExecuTorch?

ExecuTorch is PyTorch's runtime for on-device inference — running AI models directly on the device, without the cloud. It enables model export via torch.export and execution on various hardware backends through a unified API.

What is Apple MLX?

Apple MLX is an open-source machine learning framework optimized for Apple Silicon (M1/M2/M3/M4). It uses a unified memory model and Apple's Metal GPU kernels for maximum performance on Mac computers.

What does 'delegate' mean in the ExecuTorch context?

A delegate is a backend module that ExecuTorch uses to redirect (delegate) part of the computation to a specific hardware or framework — in this case Apple MLX. MLXPartitioner automatically identifies subgraphs that can be accelerated via MLX and delegates them to the Apple Silicon GPU.

PyTorch: ExecuTorch MLX Delegate delivers 3–6× faster model execution on Apple Silicon GPUs

How does the ExecuTorch MLX Delegate work?

Which models are supported?

Is Voxtral truly ready for live transcription?

Limitations and status

Frequently Asked Questions

Sources

Related news