CyberGym-E2E: AI agent benchmark for vulnerabilities

The arXiv:2606.04460 paper by Dawn Song's team (UC Berkeley circle), published on 3 June 2026, presents CyberGym-E2E, a scalable real-world benchmark that measures AI agents across the entire vulnerability lifecycle. It covers 920 real-world vulnerabilities from 139 open-source projects and three capabilities: vulnerability discovery, proof-of-concept generation and patch development.

The arXiv:2606.04460 paper presents CyberGym-E2E, a scalable real-world benchmark for measuring AI agents across the entire vulnerability lifecycle. The paper was published on 3 June 2026 (at 05:06 UTC) by the team around Dawn Song from the UC Berkeley circle. The goal of the benchmark is to realistically assess how capable AI agents are of independently finding, demonstrating and fixing security flaws in real software.

What is CyberGym-E2E?

CyberGym-E2E is a scalable real-world benchmark, that is, a tool for comparing the capabilities of AI agents that is based on real, rather than invented, examples. It contains 920 real-world vulnerabilities collected from 139 open-source projects. Relying on real projects makes the benchmark relevant to practice, because the agents have to work with real code and real security problems.

The name “E2E” stands for “end-to-end,” which emphasizes that the benchmark covers the entire path of resolving a vulnerability, from discovery to fix, rather than just a single isolated step.

Which capabilities does the benchmark measure?

CyberGym-E2E measures three key capabilities of AI agents. The first is vulnerability discovery, that is, the ability of an agent to find a security flaw in the code at all. The second is proof-of-concept (PoC) generation, proof that the discovered vulnerability can actually be exploited.

The third capability is patch development, that is, producing a fix that removes the vulnerability. By covering all three phases, the benchmark tests an agent across the entire vulnerability lifecycle, from problem identification to resolution, thereby providing a more complete picture than tests focused on just a single task.

How are the testing scenarios built?

To create the test scenarios, CyberGym-E2E uses an automated pipeline with agent enhancement. This pipeline turns data about real-world vulnerabilities into realistic scenarios suitable for testing. Automation is important because it enables scalability: new scenarios can be generated from existing data without extensive manual work.

In this way, CyberGym-E2E solves one of the main challenges of security benchmarks, namely their maintenance and expansion. As vulnerability databases are supplemented, the benchmark can evolve alongside them.

What does the benchmark not provide?

It is important to note that the paper’s abstract does not state concrete success rates for individual models on this benchmark. The publication focuses on the methodology, scope and structure of CyberGym-E2E, rather than on ranking specific systems.

For researchers and security professionals, the benchmark nonetheless represents a valuable framework for assessing the progress of AI agents in cybersecurity. More detailed results and analysis are available in the paper itself on arXiv, which remains the primary source for all numerical indicators.

Frequently Asked Questions

What is CyberGym-E2E?

CyberGym-E2E is a scalable real-world benchmark that measures AI agents across the entire vulnerability lifecycle. It contains 920 real-world vulnerabilities from 139 open-source projects, so the agents' security capabilities are tested on realistic, rather than synthetic, examples.

Which three capabilities does the benchmark measure?

The benchmark measures three capabilities: vulnerability discovery, proof-of-concept (PoC, proof that a vulnerability can be exploited) generation and patch development (a fix that removes the vulnerability). This covers the entire path from finding the problem to resolving it.

How are the benchmark scenarios created?

An automated pipeline with agent enhancement turns data about real-world vulnerabilities into realistic scenarios. This approach makes the benchmark scalable, because new scenarios can be generated from existing vulnerability data without manual work.

Does the paper provide concrete model success rates?

The paper's abstract does not state concrete success rates for individual models. The focus of the publication is on the methodology and structure of the benchmark itself, while detailed results remain in the primary source, the paper on arXiv.

arXiv:2606.04460: CyberGym-E2E measures AI agents across the entire vulnerability lifecycle

What is CyberGym-E2E?

Which capabilities does the benchmark measure?

How are the testing scenarios built?

What does the benchmark not provide?

Frequently Asked Questions

Sources

Related news