What does AgentKernelArena measure and how does it differ from standard benchmarks?

AgentKernelArena measures AI coding agents on GPU kernel optimization tasks — specifically, how much they speed up a Triton or HIP kernel while preserving numerical correctness. Unlike abstract programming tests, every task has a measurable goal directly relevant to production compute environments.

Which agent achieves the best results on HIP kernel tasks?

GEAKv3 (AMD's own agent powered by Claude Opus 4.6) leads with an average speedup of 9.04x on the hip2hip category. Claude Code (Claude Opus 4.6) places second at 6.08x, and Cursor Agent (Claude Opus 4.6) third at 5.03x.

On which hardware platform were the agents tested?

All experiments were run on an AMD Instinct MI300X GPU with 192 GB of HBM3 memory, inside a ROCm 7.1.1 PyTorch container. Each agent had a time limit of 3,600 seconds per task and a maximum of 3 iterations.

AMD AgentKernelArena: GPU Optimization Benchmark

AMD Research published the open benchmarking framework AgentKernelArena on July 3, 2026, measuring how well AI coding agents optimize real GPU kernels. Across 214 tasks in four categories, AMD's own GEAKv3 (Claude Opus 4.6) leads with a 9.04x speedup on HIP kernels, while Claude Code (Opus 4.6) places second at 6.08x. All experiments were run on an AMD Instinct MI300X within ROCm 7.1.1.

AMD Research published the open benchmarking framework AgentKernelArena on July 3, 2026, measuring how well AI coding agents optimize real GPU kernels. Unlike standard programming benchmarks that test general coding ability, every task in AgentKernelArena has a concrete, measurable goal: the agent must take an existing GPU kernel and write a faster version that produces identical numerical results. GPU kernel optimization is a critical segment of AI system development — differences in operator performance directly affect model training costs and inference latency in production systems. The framework is designed for standardized, reproducible agent comparisons and has been released as an open project.

What Does AgentKernelArena Measure and How Are Results Scored?

The total collection contains 214 tasks distributed across four categories based on the type of kernel transformation. Triton2triton covers 148 tasks and measures an agent’s ability to optimize an existing Triton kernel. Hip2hip contains 36 tasks focused on HIP kernel optimization. Torch2hip includes 26 tasks in which the agent rewrites PyTorch operations into an equivalent HIP kernel. The repository-scale category contains 4 tasks that simulate work at the level of entire code repositories. The evaluation described in this paper used a representative subset of 44 tasks.

Scoring is three-tiered. Compilation awards up to 20 points: the kernel must compile syntactically without errors. Correctness awards up to 100 points: the optimized kernel must produce the same numerical results as the reference implementation across all test cases. Speedup is calculated as the ratio of optimized to original kernel speed, multiplied by 100 — the larger the speedup coefficient, the greater the contribution to the total score. The scoring structure intentionally rewards not just correctness but actual performance improvement: a kernel that runs correctly but delivers zero speedup or even degrades performance scores lower than one that actively accelerates computation.

Six Agents on AMD Instinct MI300X Within ROCm 7.1.1

Six agent configurations were tested with different combinations of agent framework and underlying language model. AMD’s own agent GEAKv3 was used with Claude Opus 4.6. Cursor Agent was tested with three models: Claude Opus 4.6, GPT-5.3 Codex, and Composer 2. Claude Code was tested with Claude Opus 4.6 and Claude Sonnet 4.6. All agents were given identical conditions: a time limit of 3,600 seconds per task and a maximum of 3 iterations per attempt.

All experiments were run on an AMD Instinct MI300X GPU with 192 GB HBM3 memory inside a ROCm 7.1.1 PyTorch container (rocm/pytorch:rocm7.1.1_ubuntu24.04_py3.12_pytorch_release_2.10.0). The MI300X was chosen as the reference hardware platform because it represents the production standard for demanding AI inference and training workloads on AMD architecture.

GEAKv3 Leads, Claude Code Second on HIP Kernels

GEAKv3 (Claude Opus 4.6) achieves a convincing first-place result across all categories: an average speedup of 9.04× on hip2hip tasks, 2.75× on triton2triton, and 1.20× on rocPRIM repository tasks. AMD’s own agent advantage is particularly pronounced on HIP kernel transformations, where it leads by almost double over second place.

Among standard frontier agents, Claude Code (Claude Opus 4.6) places second in the hip2hip category with a speedup of 6.08×. Cursor Agent with Claude Opus 4.6 is third at 5.03×. The GPT-5.3 Codex configuration achieves 3.06×, while Cursor with Composer 2 lands at 1.34× — barely ahead of the unoptimized reference kernel.

On triton2triton tasks the ranking shifts and differences are substantially smaller: Cursor (Opus 4.6) and Claude Code (Opus 4.6) are nearly tied at 1.96× and 1.95× respectively. A concerning finding comes from the GPT-5.3 Codex (0.99×) and Composer 2 (0.98×) configurations, which fall below the reference baseline — meaning those models under these conditions actively degrade kernel performance rather than improving it.

AgentKernelArena has been released as an open project, with all tasks and evaluation infrastructure available to the research and development community. The authors — an AMD Research team including Sharareh Younesian, Wenwen Ouyang, Sinu Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Vikrama Appiu, Zhenyua Gu, and Emad Barsoum — invite the community to expand the task collection and test new agent configurations.

AMD AgentKernelArena: Open Benchmark for AI Agents on GPU Kernel Optimization

What Does AgentKernelArena Measure and How Are Results Scored?

Six Agents on AMD Instinct MI300X Within ROCm 7.1.1

GEAKv3 Leads, Claude Code Second on HIP Kernels

Frequently Asked Questions

Sources

Related news