arXiv:2605.04019: automated red teaming agent achieves 85% success rate against Meta Llama Scout with 45+ attacks and 450+ transformations
A new paper presents an agentic red teaming system built on the Dreadnode SDK that achieves an 85% success rate against Meta's Llama Scout using 45+ attacks, 450+ transformations and 130+ scorers, reducing security testing from weeks to hours without any hand-written code.
This article was generated using artificial intelligence from primary sources.
A new arXiv paper describes a system that fully automates offensive security testing of AI models. Authors Raja Sekhar Rao Dheekonda, Will Pearce and Nick Landers demonstrate how an agentic approach, built on the Dreadnode SDK, changes the economics of red teaming — security testing that previously required weeks of expert work is reduced to a few hours without a single line of hand-written attack code.
How does the agent replace weeks of manual work?
Red teaming in a security context is the process in which specialists systematically search for model weaknesses — from adversarial examples to jailbreak prompts and multimodal attacks. Conventionally it is done by teams who manually assemble and execute attacks one by one.
The proposed system instead uses a catalog of 45+ attacks, 450+ transformations and 130+ scorers that the agent combines autonomously. An operator sets a goal in natural language through a Terminal User Interface, and the agent selects vectors, applies variations and evaluates results.
What do the numbers against Llama Scout show?
In evaluation against Meta’s Llama Scout model, the agent achieves an 85% success rate with maximum severity rated at 1.0 by internal scorers. The entire cycle — from a stated goal to a comprehensive report — completes in the order of hours, not weeks as was previously typical for a similar scope of tests.
Critically, the agent operates without human-developed code: the entire adversarial workflow is generated from available components, removing the bottleneck of specialized red teaming engineers who are chronically scarce in the industry.
What does this change for security teams?
The agentic framework covers both traditional ML adversarial examples and generative AI jailbreaks within a single unified system — an approach that was previously fragmented across different tools. For enterprise security teams and AI labs that must continuously evaluate new models, this means the frequency of testing can increase significantly.
The paper joins a growing wave of research applying agentic automation to security disciplines, similar to how SOC analysts earlier began using AI assistants for incident triage. An open question remains how transferable the results are to closed commercial models with different safety filters — Llama Scout is an open-weight target that allows detailed instrumentation unavailable in API-only models.
Frequently Asked Questions
- What is red teaming in the context of AI systems?
- Red teaming is the process of controlled attacks against an AI system to discover security vulnerabilities — from classic adversarial examples to jailbreak prompts — before a real attacker exploits them.
- What does the agent do differently from manual red teaming?
- An operator specifies a goal in natural language through a Terminal User Interface, and the agent autonomously combines attacks, transformations and scorers from the Dreadnode catalog — no manual workflow assembly or custom code required.
- What does the 85% success rate mean?
- In 85% of tested attack scenarios, the agent succeeded in inducing unwanted behavior in Meta's Llama Scout model, with maximum severity rated at 1.0 by the system's scorers.
Related news
GitHub: Secret scanning via MCP server reaches GA — AI agents detect credentials before commit
ArXiv: Visual inputs bypass safety filters in vision-language models 40.9% of the time, ICML 2026 authors find
CNCF: immutable digest pinning, least-privilege tokens, and ephemeral runners — a recipe card for a more secure GitHub Actions pipeline