arXiv:2605.22763: AI agent with Lean verification solves 9 open Erdős problems and 44 OEIS conjectures
A team of 20 researchers from DeepMind and MIT CSAIL published the first large-scale evaluation of LLMs for autonomous generation of formal proofs in the Lean theorem prover. The agent combines LLM generation with Lean symbolic verification and autonomously solves 9 of 353 open Erdős problems and proves 44 of 492 OEIS conjectures.
This article was generated using artificial intelligence from primary sources.
An arXiv preprint published May 21, 2026, presents the first large-scale evaluation of LLMs for autonomously generating formal mathematical proofs in the Lean theorem prover, applied to open Erdős problems and conjectures from the Online Encyclopedia of Integer Sequences (OEIS). A team of 20 researchers — including members from DeepMind and MIT CSAIL — showed that an advanced agent autonomously solves 9 of 353 open Erdős problems and proves 44 of 492 OEIS conjectures.
What is Lean and why is it central to this approach?
Lean is a proof assistant, a programming language in which mathematical proofs are written as types that are automatically verified by the compiler. Unlike informal mathematical text — which can contain subtle errors that slip through peer review — a Lean proof either compiles (and is then mathematically correct) or does not (and is rejected). There is no room for human error in verification.
This property is crucial for an AI system. LLMs can generate mathematical text that looks convincing but contains errors; without automatic verification, human mathematicians must manually check every proof, which is a bottleneck. With Lean, the system generates candidate proofs and Lean verifies them in milliseconds — if it does not compile, the system iterates; if it compiles, the proof is definitively correct.
What is the Erdős problem set and why is it significant?
Erdős problems are a set of open mathematical questions formulated by Paul Erdős (1913–1996) during his career. They cover discrete mathematics, number theory, combinatorics, graph theory, and extremal combinatorics. Many carry Erdős’s promised prizes (from $25 to $10,000) for solutions. The Erdős Problems service project maintains a list of approximately 800 such problems, of which the authors selected 353 that could be formulated in Lean.
Of 353 problems, the autonomous agent solves 9 (2.5 percent) — which the authors classified as “lower-tier” Erdős problems that yield to structural arguments or exhaustive search of a sufficiently small space. “Lower-tier” does not mean trivial — the problems had been open for decades, they just did not require the brilliant combinatorial intuition the agent does not possess. These results were coordinated with the Erdős Problems administrators who confirmed them independently.
How does the agent combine LLM generation and Lean verification?
The agent has a cyclic architecture. Step 1: The LLM (the authors specify an internal DeepMind frontier model variant with formal-math fine-tuning) reads the Lean problem formulation and generates a hypothesis about the proof structure. Step 2: The agent compiles that hypothesis through Lean — if it compiles, it returns success; if not, Lean returns a specific error (e.g., “unknown identifier,” “type mismatch,” “tactic failed”). Step 3: The agent feeds that error back to the LLM with an instruction to iterate. Step 4: If 5 iterations fail, the agent decomposes the problem into smaller lemmas and attempts to solve them separately.
The authors emphasize that the agent was not performing unconstrained search — the Lean compiler’s feedback structured the search in a way that would take a human months, while the agent completes it in hours. A typical solved Erdős problem required 200–500 LLM calls and 3–12 hours of wall-clock time on an 8×H100 setup.
What is OEIS and what are the results there?
OEIS (Online Encyclopedia of Integer Sequences) is a database of more than 380,000 integer sequences with descriptions, formulas, and conjectures. Many conjectures in OEIS are formulated as “this sequence is probably generated by formula F, but that has not been proven.” The authors selected 492 such conjectures and let the agent attempt to formally prove each.
The agent proved 44 (8.9 percent), which the authors again coordinated with OEIS maintainers for inclusion in the official records. Most proved conjectures concern closed forms for recursive sequences or auxiliary identities arising from already-proved larger results. Conjectures that eluded the agent mostly require a combinatorial bijection or structural argument the agent did not discover autonomously.
What does this mean for mathematical research?
The authors do not claim that the AI agent replaces mathematicians. They claim that a functioning assistant is now operational that can handle the “low-hanging fruit” in proof formalization — freeing researchers to focus on problems that require human creativity. Next steps include developing agents that can propose new conjectures based on pattern recognition, and integrating the agent with the Lean Mathlib database (90,000+ formalized theorems) for a richer reference frame.
Frequently Asked Questions
- What is Lean and why is it used?
- Lean is a proof assistant — a programming language in which mathematical proofs are written as types and automatically verified. Unlike informal mathematical text, a Lean proof either compiles (correct) or does not (incorrect), with no room for human error in verification.
- What is an Erdős problem?
- Erdős problems are a set of open mathematical questions formulated by Paul Erdős during his career, covering discrete mathematics, number theory, combinatorics, and graph theory. Many have been open for decades and carry monetary prizes for solutions.
- What is the scope of this result?
- 9 of 353 open Erdős problems (2.5 percent) and 44 of 492 OEIS conjectures (8.9 percent) is a significant result for an autonomous AI system, but far from completely solving the domain — most problems remain open and require mathematical intuition the agent does not possess.
Related news
arXiv:2605.06540: Frontier models fall below diversity threshold in idea generation
arXiv:2604.21508 BioMiner: multimodal AI extracts protein-ligand bioactivity from literature, 5.59× faster than manual work
Google Photos Auto Frame uses 3D models and diffusion to expand the frame