AMix-2: Proteins as a Modality in LLMs

AMix-2 is a protein-text foundation model that unifies protein understanding and sequence design in a shared token space. It uses a block-wise diffusion language backbone, introduces the ProteinArena benchmark, and outperforms frontier LLMs while competing with specialized protein models.

A new paper on arXiv introduces AMix-2, a foundation model that introduces proteins as a native modality within a large language model. Instead of using separate, task-specific models, AMix-2 places natural language and protein sequences in a shared token space. This unifies protein understanding and conditional sequence design in a single system capable of biological reasoning.

How does the block-wise diffusion backbone work?

The model’s foundation is a block-wise diffusion language model. This approach combines causal generation between blocks with bidirectional context and iterative refinement within each block. The authors state that such a structure better reflects the nature of proteins than strictly left-to-right generation. Controlled experiments showed that the diffusion approach generally outperforms its autoregressive counterpart.

What is ProteinArena?

The team introduced ProteinArena, a comprehensive evaluation framework. It contains time-aware and homology-aware protocols across various understanding and design tasks, with comparisons against classic bioinformatics tools, specialized protein models and language models. The goal is a fairer and more realistic measurement of true generalization.

How good is it?

According to the results, AMix-2 outperforms frontier LLMs and shows competitive performance against task-specific protein models. The paper spans 30 pages, with 4 figures and 12 tables, and was submitted on May 29, 2026. Behind it stands a large team of researchers led by Keyue Qiu.

Frequently Asked Questions

What is AMix-2?

AMix-2 is a foundation model that treats proteins as a native modality within a large language model, unifying protein understanding and the design of their sequences in the same model.

What is ProteinArena?

ProteinArena is a new benchmark introduced in the paper with time-aware and homology-aware protocols for fairly measuring protein understanding and design tasks.

arXiv:2605.30963: AMix-2 Introduces Proteins as a Native Modality in LLMs

How does the block-wise diffusion backbone work?

What is ProteinArena?

How good is it?

Frequently Asked Questions

Sources

Related news