arXiv:2604.21508 BioMiner: multimodal AI extracts protein-ligand bioactivity from literature, 5.59× faster than manual work
Why it matters
The team of Jiaxian Yan and colleagues published on April 23, 2026 BioMiner — a multimodal AI system for automated extraction of protein-ligand bioactivity from scientific literature. The system processes text, tables and molecular structures, achieves F1 0.32 on the new BioVista benchmark (16,457 entries from 500 publications) and in a demonstration application extracts 82,262 data points from 11,683 papers.
A large team of authors led by Jiaxian Yan (including Jintao Zhu, Yuhang Yang, Qi Liu, Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao and Enhong Chen) published on April 23, 2026 the paper “BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature” (arXiv:2604.21508). The work targets one of the hardest bottlenecks in modern drug discovery.
Why is manual data extraction a bottleneck?
The development of a new drug relies on protein-ligand bioactivity data — tables showing how strongly a given molecule binds a target protein. These data are scattered across tens of thousands of scientific publications, typically as a combination of text (protocol descriptions), tables (numerical IC50/Ki values) and images (molecular structures, often in so-called Markush notation representing classes of structurally related compounds). Manual curation of a single paper can take hours — a pace that cannot keep up with the rate of literature publication.
How does BioMiner work?
The system explicitly separates semantic interpretation from structure construction. For bioactivity semantics, BioMiner uses direct LLM reasoning. For chemical structures, the authors introduce a chemical-structure-grounded visual semantic reasoning paradigm: a multimodal LLM operates on visual representations grounded in chemical rules and derives mutual relationships, while exact molecular construction is delegated to specialized chemical tools (RDKit-type software). This is important because LLMs on their own often hallucinate structurally impossible molecules.
What are the concrete results?
The authors establish a new benchmark BioVista with 16,457 bioactivity entries from 500 publications — a significant contribution to the community. BioMiner achieves F1 0.32 for bioactivity triplets on this benchmark, which the authors present as the first quantitative baseline for the task.
Practical value is demonstrated through three applications:
- 82,262 data points extracted from 11,683 papers — a pre-training base that improves downstream models by 3.9%
- Human-in-the-loop NLRP3 workflow — doubled the number of quality bioactivity records, delivered a 38.6% improvement across 28 QSAR models and identified 16 hit candidates with novel scaffolds
- PoseBusters annotation — 5.59× faster than manual work with 5.75% better accuracy
Commercial pharmaceutical value
For pharmaceutical companies this is not merely an academic paper — it directly affects the preclinical workflow. Less time for data curation means more time for actual medicinal chemistry work, and larger trained data sets mean more accurate QSAR models and better lead compound selection. The identification of novel scaffolds for NLRP3 (a target linked to inflammatory diseases) is a concrete example of how the tool can directly contribute to the drug candidate pipeline.
This article was generated using artificial intelligence from primary sources.