arXiv:2605.30963: AMix-2 Introduces Proteins as a Native Modality in LLMs
AMix-2 is a protein-text foundation model that unifies protein understanding and sequence design in a shared token space. It uses a block-wise diffusion language backbone, introduces the ProteinArena benchmark, and outperforms frontier LLMs while competing with specialized protein models.
This article was generated using artificial intelligence from primary sources.
A new paper on arXiv introduces AMix-2, a foundation model that introduces proteins as a native modality within a large language model. Instead of using separate, task-specific models, AMix-2 places natural language and protein sequences in a shared token space. This unifies protein understanding and conditional sequence design in a single system capable of biological reasoning.
How does the block-wise diffusion backbone work?
The model’s foundation is a block-wise diffusion language model. This approach combines causal generation between blocks with bidirectional context and iterative refinement within each block. The authors state that such a structure better reflects the nature of proteins than strictly left-to-right generation. Controlled experiments showed that the diffusion approach generally outperforms its autoregressive counterpart.
What is ProteinArena?
The team introduced ProteinArena, a comprehensive evaluation framework. It contains time-aware and homology-aware protocols across various understanding and design tasks, with comparisons against classic bioinformatics tools, specialized protein models and language models. The goal is a fairer and more realistic measurement of true generalization.
How good is it?
According to the results, AMix-2 outperforms frontier LLMs and shows competitive performance against task-specific protein models. The paper spans 30 pages, with 4 figures and 12 tables, and was submitted on May 29, 2026. Behind it stands a large team of researchers led by Keyue Qiu.
Frequently Asked Questions
- What is AMix-2?
- AMix-2 is a foundation model that treats proteins as a native modality within a large language model, unifying protein understanding and the design of their sequences in the same model.
- What is ProteinArena?
- ProteinArena is a new benchmark introduced in the paper with time-aware and homology-aware protocols for fairly measuring protein understanding and design tasks.
Sources
Related news
arXiv:2606.20205: Psychological Profiles of LLMs Are Largely a Measurement Artifact, Not a Stable Personality
Google Research: passive heart-health monitoring via smartphone camera
arXiv:2606.03883: What does the reasoning structure of large language models actually look like?