PyTorch: ExecuTorch MLX Delegate delivers 3–6× faster model execution on Apple Silicon GPUs
The PyTorch team released the experimental ExecuTorch MLX Delegate — a backend that leverages the Apple MLX framework and Metal GPU kernels for 3 to 6 times greater throughput on Apple Silicon chips. Supports Llama 3.2, Qwen 3, Phi-4 mini, Whisper and Voxtral real-time streaming transcription.
This article was generated using artificial intelligence from primary sources.
The PyTorch team released the experimental ExecuTorch MLX Delegate — a new backend that accelerates PyTorch models on macOS using the Apple MLX framework and optimized Metal GPU kernels. The result is generative AI workloads with 3 to 6 times greater throughput compared to existing ExecuTorch delegates on macOS.
How does the ExecuTorch MLX Delegate work?
ExecuTorch is PyTorch’s runtime for on-device inference that exports the model via torch.export and then lowers it into a .pte format ready for execution. The MLX Delegate adds a new step: MLXPartitioner analyzes the exported graph and delegates compatible subgraphs directly to Apple MLX, which executes them via the Apple Silicon GPU.
The workflow is three-step:
- Model export with
torch.export - Lowering with
to_edge_transform_and_lowerusingMLXPartitioner - Running the
.ptefile through the ExecuTorch runtime
The delegate supports approximately 90 ATen operations, including quantized matmul, multi-head attention, rotary position embeddings and Mixture-of-Experts routing.
Which models are supported?
Is Voxtral truly ready for live transcription?
Yes — the MLX Delegate supports Mistral Voxtral Realtime (4B) with live microphone input for real-time streaming transcription directly on a Mac, without an internet connection.
Full list of supported models:
- LLMs: Llama 3.2 (1B), Qwen 3 (0.6B, 1.7B, 4B), Phi-4 mini (3.8B), Gemma 3 (1B, 4B)
- MoE models: Qwen 3.5 35B-A3B with 256 experts and top-8 routing
- Speech-to-text: OpenAI Whisper (tiny to large-v3-turbo), NVIDIA Parakeet TDT (0.6B), Mistral Voxtral (3B)
Quantization is available in BF16, FP16, FP32 and 2/4/8-bit affine quantization via TorchAO, as well as NVFP4.
Limitations and status
The delegate is marked as experimental — APIs and supported features may change. Acceleration is available exclusively on Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU support; Intel Mac computers are not supported. All other platforms (Android, Linux, Windows) continue to use existing ExecuTorch delegates.
Source code is available in the PyTorch ExecuTorch repository on GitHub.
Frequently Asked Questions
- What is ExecuTorch?
- ExecuTorch is PyTorch's runtime for on-device inference — running AI models directly on the device, without the cloud. It enables model export via torch.export and execution on various hardware backends through a unified API.
- What is Apple MLX?
- Apple MLX is an open-source machine learning framework optimized for Apple Silicon (M1/M2/M3/M4). It uses a unified memory model and Apple's Metal GPU kernels for maximum performance on Mac computers.
- What does 'delegate' mean in the ExecuTorch context?
- A delegate is a backend module that ExecuTorch uses to redirect (delegate) part of the computation to a specific hardware or framework — in this case Apple MLX. MLXPartitioner automatically identifies subgraphs that can be accelerated via MLX and delegates them to the Apple Silicon GPU.