🟡 📦 Open Source Published: · 4 min read ·

Miles: A PyTorch-Native Open-Source Framework for RL Post-Training of Frontier-Scale LLMs

Editorial illustration: PyTorch-native Miles stack for reinforcement learning post-training with kernel fusion

RadixArk releases Miles, an open-source reinforcement learning framework that integrates SGLang, Megatron-LM, Ray, and PyTorch into a single production-tested stack for post-training large language models on Hopper and Blackwell GPUs.

🤖

This article was generated using artificial intelligence from primary sources.

RadixArk has released Miles, an open-source reinforcement learning (RL) framework for post-training frontier-scale large language models, as a contribution to the PyTorch ecosystem. Miles addresses one of the most demanding engineering problems in modern LLM development: coordinating rollout generation, distributed training, and weight synchronization across specialized hardware at cluster scale.

Why Is RL Post-Training So Demanding?

Training LLMs with methods such as RLHF or rule-based reinforcement learning is not just a matter of algorithm — it is a distributed systems problem. The rollout phase generates samples using model inference, while the training phase updates weights, and both processes must be coordinated in real time with minimal downtime. At frontier scales, with hundreds of GPUs, complex requirements for network bandwidth, fault tolerance, and numerical consistency are unavoidable.

Miles addresses that coordination complexity with a single integrated stack combining four established components: SGLang for high-throughput rollout generation, Megatron-LM as a scalable backend for distributed training, Ray for cluster orchestration and actor lifecycle management, and PyTorch for models, autograd, and distribution primitives.

”Small Core, Many Extensions” Design

Miles’s core philosophy is a compact training loop with explicit extension points. Rather than forking the framework for each new experiment, users are offered pluggable interfaces for rollout functions, task-specific reward functions, RL loss functions, sample filters, and training hooks for metrics and diagnostics. This design ensures reproducibility for large experiments without accumulating ad hoc infrastructure.

One of the key innovative elements is MoE-aware Routing Replay — a mechanism that preserves routing decisions of MoE (Mixture-of-Experts) models across the boundary between the rollout and training phases. Without this consistency, the distribution of inputs across experts can destabilize between sample generation and gradient updates, undermining convergence.

Asynchronous RL and Weight Synchronization

Miles supports fully asynchronous RL with continuous sample streaming: rollout and training phases can be fully separated or colocated, depending on available hardware and experiment requirements. Weight synchronization between phases is handled through NCCL/RDMA channels, minimizing parameter transfer latency. GPU-aware Ray placement groups ensure that actors are placed optimally with respect to network topology and rack layout.

For long-running workloads — experiments can last a week or more — Miles uses Ray’s supervision model for fault tolerance: the failure of a single worker does not cause the entire experiment to crash.

Precisions and LoRA Support

Miles provides a unified pipeline covering both phases with full support for low precisions: BF16, FP8, MXFP8, and INT4-QAT. Each precision is available across rollout and training without manual conversion management. Additionally, LoRA (Low-Rank Adaptation) is supported through both paths — practical for parameter-efficient post-training on models that do not fit in full precision.

Production Validation on Frontier Models

Miles is not merely a research prototype. The framework has been production-tested on some of the most prominent open-source models released in the first half of 2026: DeepSeek-V4, Kimi K2.5 and K2.6, GLM-5 and GLM-5.1, and Qwen3.5 and Qwen3.6. All of these models come with ready-to-run recipes included in the repository, significantly reducing the time a new user needs to get their own experiment running.

Hardware support covers NVIDIA Hopper and Blackwell GPU architectures, with GPU-aware scheduling that exploits the characteristics of both hardware generations.

Practical Significance for the Community

Miles’s contribution to the PyTorch ecosystem matters for several reasons. First, it consolidates four tools that are commonly used together but without a standardized interface. Second, it provides a reference implementation for asynchronous RL that is reproducible and operational in production. Third, the pluggable architecture means researchers can experiment with new algorithms without needing to understand all the distributed details of the stack.

The project is available on GitHub under the PyTorch organization and already includes documentation, recipes for the listed models, and guides for adapting individual components.

Frequently Asked Questions

What is Miles and who developed it?
Miles is an open-source reinforcement learning framework developed by RadixArk. It is designed for RL post-training of frontier-scale LLMs and is built natively on PyTorch, combining SGLang, Megatron-LM, and Ray.
Which models have been production-tested with Miles?
Miles has been production-tested on DeepSeek-V4, Kimi K2.5 and K2.6, GLM-5 and 5.1, and Qwen3.5 and Qwen3.6. All of these models have ready-to-run recipes included in the repository.
Which precisions and GPU architectures does Miles support?
Miles supports BF16, FP8, MXFP8, and INT4-QAT precisions through a unified pipeline covering both rollout and training. It has been hardware-tested on NVIDIA Hopper and Blackwell GPUs.