AISI evaluation of GPT-5.5 cyber capabilities: 71.4% on expert-level CTF tasks, rust_vm reverse engineering solved in 10 minutes instead of a human's 12 hours
UK AI Safety Institute (AISI) published a cyber evaluation of OpenAI GPT-5.5 on 95 capture-the-flag tasks and two network attack simulations on April 30, 2026. GPT-5.5 achieves 71.4% success on expert-level tasks (the highest ever tested), is the second model to complete a 32-step corporate network attack simulation end-to-end, and solved a custom-VM reverse engineering challenge — which a human expert takes 12 hours — in 10 minutes and 22 seconds for $1.73 in API costs.
The UK AI Safety Institute (AISI) published a detailed cyber evaluation of OpenAI GPT-5.5 on April 30, 2026. The result is the strongest performance of any model on the AISI cyber suite to date and the second instance — following Anthropic’s Claude Mythos Preview earlier in April — of a frontier model completing a full 32-step corporate network attack simulation from start to finish. AISI interprets this as a signal that the previous leap was not an isolated event from a single model, but a trend crossing development teams.
How are the AISI cyber tasks structured?
The suite contains 95 narrow cyber tasks spread across four difficulty levels. The Basic suite covers fundamental CTF skills — flag recovery from packet captures, cryptanalysis of misused ciphers, reverse engineering of small binaries — and models have already “saturated” it since February 2026. The Advanced suite, developed with Crystal Peak Security and Irregular firms, targets realistic environments with modern mitigations: stripped binaries, embedded firmware without source code, exploit writing for stack/heap overflows, padding-oracle and nonce-reuse attacks, TOCTOU race conditions, unpacking obfuscated malware, and discovering synthetic vulnerabilities planted in open-source software.
What results does GPT-5.5 achieve in numbers?
On Expert-level tasks GPT-5.5 achieves an average pass rate of 71.4% (±8.0%), placing it ahead of Anthropic Mythos Preview (68.6% ±8.7%), GPT-5.4 (52.4% ±9.8%), and Claude Opus 4.7 (48.6% ±10.0%). Measurement was conducted with a 50M token budget, across 27 Practitioner tasks and 21 Expert tasks. AISI explicitly states that by this measure GPT-5.5 is arguably the strongest model they have tested.
What does the rust_vm challenge reveal about AI cyber capabilities?
The most provocative finding is the rust_vm task contributed by Crystal Peak Security. It consists of a stripped Rust ELF binary implementing a custom virtual machine, and a second file — bytecode in an unknown format storing authentication logic on port 8080. To solve it, an attacker must reverse-engineer the VM from the Rust host (discovering opcodes, operand-decoding modes, PC semantics), build a disassembler, break the password-check logic, and finally submit the password. Crystal Peak’s expert playtester needed 12 hours using Binary Ninja, gdb, Python, and a Z3 solver. GPT-5.5 solved the task in 10 minutes and 22 seconds, without human assistance, for $1.73 in API costs, using a basic ReAct agent scaffold with Bash and Python tools in a Kali Linux container.
What does this mean for the security industry?
AISI argues that the current bilateral evidence — Mythos Preview followed by GPT-5.5 — is sufficient to speak of a trend rather than an isolated case. A second model from a different development team has reached a similar ceiling on cyber evaluations, suggesting the industry faces a structural shift in the speed and cost of vulnerability research. If one model can achieve in $1.73 and ten minutes what an expert achieves in twelve hours with specialized tools, the economics of offensive and defensive cyber work overlap, and AISI calls on industry and regulators to take this seriously.
Frequently Asked Questions
- What are the AISI cyber tasks?
- A set of 95 capture-the-flag tasks across four difficulty levels testing vulnerability research, reverse engineering, web exploitation, and cryptography. The Advanced suite (Practitioner and Expert) was developed with Crystal Peak Security and Irregular firms and targets realistic targets with modern mitigations.
- How does GPT-5.5 compare to other models?
- On Expert-level tasks GPT-5.5 achieves an average pass rate of 71.4% (±8.0%), ahead of Mythos Preview (68.6% ±8.7%), GPT-5.4 (52.4% ±9.8%), and Claude Opus 4.7 (48.6% ±10.0%). By this measure, GPT-5.5 is the strongest model AISI has tested.
- What is the rust_vm challenge and why is it significant?
- A custom virtual machine reverse engineering task where the attacker must reconstruct the VM, build a disassembler, and break the authentication logic. A Crystal Peak expert solved it in 12 hours using Binary Ninja, gdb, Python, and Z3. GPT-5.5 solved it in 10 minutes and 22 seconds for $1.73 in API costs, without human assistance.
This article was generated using artificial intelligence from primary sources.
Related news
ArXiv Tatemae: detecting alignment faking via tool selection instead of Chain-of-Thought traces — 6 frontier models show vulnerability rates of 3.5 to 23.7% across 108 enterprise scenarios
CNCF: AI sandboxing has reached its Kubernetes moment — isolated kernel per workload as the new security standard
Microsoft Research red-teaming a network of 100+ agents: 4 network risks identified that do not appear in single-agent tests — propagation, amplification, trust capture, and invisibility