arXiv:2604.21508 BioMiner: multimodal AI extracts protein-ligand bioactivity from literature, 5.59× faster than manual work
The team of Jiaxian Yan and colleagues published on April 23, 2026 BioMiner — a multimodal AI system for automated extraction of protein-ligand bioactivity from scientific literature. The system processes text, tables and molecular structures, achieves F1 0.32 on the new BioVista benchmark (16,457 entries from 500 publications) and in a demonstration application extracts 82,262 data points from 11,683 papers.
This article was generated using artificial intelligence from primary sources.
A large team of authors led by Jiaxian Yan (including Jintao Zhu, Yuhang Yang, Qi Liu, Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao and Enhong Chen) published on April 23, 2026 the paper “BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature” (arXiv:2604.21508). The work targets one of the hardest bottlenecks in modern drug discovery.
Why is manual data extraction a bottleneck?
The development of a new drug relies on protein-ligand bioactivity data — tables showing how strongly a given molecule binds a target protein. These data are scattered across tens of thousands of scientific publications, typically as a combination of text (protocol descriptions), tables (numerical IC50/Ki values) and images (molecular structures, often in so-called Markush notation representing classes of structurally related compounds). Manual curation of a single paper can take hours — a pace that cannot keep up with the rate of literature publication.
How does BioMiner work?
The system explicitly separates semantic interpretation from structure construction. For bioactivity semantics, BioMiner uses direct LLM reasoning. For chemical structures, the authors introduce a chemical-structure-grounded visual semantic reasoning paradigm: a multimodal LLM operates on visual representations grounded in chemical rules and derives mutual relationships, while exact molecular construction is delegated to specialized chemical tools (RDKit-type software). This is important because LLMs on their own often hallucinate structurally impossible molecules.
What are the concrete results?
The authors establish a new benchmark BioVista with 16,457 bioactivity entries from 500 publications — a significant contribution to the community. BioMiner achieves F1 0.32 for bioactivity triplets on this benchmark, which the authors present as the first quantitative baseline for the task.
Practical value is demonstrated through three applications:
- 82,262 data points extracted from 11,683 papers — a pre-training base that improves downstream models by 3.9%
- Human-in-the-loop NLRP3 workflow — doubled the number of quality bioactivity records, delivered a 38.6% improvement across 28 QSAR models and identified 16 hit candidates with novel scaffolds
- PoseBusters annotation — 5.59× faster than manual work with 5.75% better accuracy
Commercial pharmaceutical value
For pharmaceutical companies this is not merely an academic paper — it directly affects the preclinical workflow. Less time for data curation means more time for actual medicinal chemistry work, and larger trained data sets mean more accurate QSAR models and better lead compound selection. The identification of novel scaffolds for NLRP3 (a target linked to inflammatory diseases) is a concrete example of how the tool can directly contribute to the drug candidate pipeline.
Frequently Asked Questions
- What is the problem of manual data mining in drug discovery?
- Pharmaceutical companies and academic researchers must manually read thousands of papers to extract bioactivity data — IC50, Ki, Kd values and ligand structures. The work takes days per publication and the literature is growing exponentially. BioMiner automates this.
- What does multimodality mean in BioMiner?
- The system simultaneously interprets text (experimental descriptions), tables (numerical bioactivity values) and images (molecular structures, including Markush structures) — all three modalities are necessary because bioactivity data is distributed across different forms of representation in scientific publications.
- What is the pharmaceutical value?
- In a human-in-the-loop pilot project, BioMiner doubled the number of quality NLRP3 data points, delivered a 38.6% improvement across 28 QSAR models and identified 16 hit candidates with novel scaffolds — a direct input to the drug discovery pipeline.