VYPR
researchPublished Jun 4, 2026· 1 source

AI Benchmark Shows Mythos Outperforming GPT5.5 in Google Chrome Exploit Generation

A new benchmark called ExploitBench reveals that Anthropic's Mythos AI model significantly outperforms OpenAI's GPT5.5 in generating exploits for Google Chrome vulnerabilities.

A groundbreaking independent benchmark, ExploitBench, has revealed that Anthropic's Claude Mythos AI model demonstrates superior capabilities in generating exploits for real-world Google Chrome vulnerabilities compared to OpenAI's GPT5.5. The findings, presented at Infosecurity Europe 2026 by Bugcrowd in collaboration with Carnegie Mellon University and top Chrome vulnerability researchers, mark a significant step in understanding the offensive potential of advanced AI.

David Brumley, chief AI & science officer at Bugcrowd, described ExploitBench as the first independent benchmark to measure AI models' ability to not just identify but actively exploit vulnerabilities step-by-step. The benchmark evaluates five tiers of exploitation capability, culminating in arbitrary code execution against a vulnerable V8 build, the JavaScript engine powering Google Chrome. Unlike simpler crash-detection tests, ExploitBench scores progress through staged exploitation outcomes, offering a more nuanced assessment of AI performance.

In head-to-head tests, Anthropic's Mythos achieved an average score of 9.90 out of 16, reaching the highest exploitation tier on 21 out of 41 vulnerabilities. In contrast, OpenAI's GPT5.5 scored an average of 5.51 and reached the top tier on only two cases. Brumley highlighted that Mythos could exploit one-day vulnerabilities in Chrome approximately 50% of the time, a performance level comparable to elite human researchers and potentially valuable for bug bounty programs.

While GPT5.5's performance was lower in these initial tests, its broader availability means more individuals can leverage it for exploit development. The benchmark underscores the rapidly closing gap between AI models and skilled human exploit developers. This advancement is particularly concerning as it lowers the barrier for threat actors to weaponize newly discovered flaws, potentially accelerating the timeline from vulnerability disclosure to active exploitation.

Experts caution against extrapolating these results too broadly, noting that Chrome is a highly sophisticated and extensively audited target. However, the ability of AI models to generate step-by-step exploitation plans, replan as needed, and execute multi-stage actions—a key improvement highlighted by Michael Price, VP of product engineering at VulnCheck—significantly enhances their offensive potential.

Bugcrowd released ExploitBench alongside reinforcement learning (RL) environments designed to both measure and improve AI model capabilities. This dual approach aims to drive progress in AI-driven offensive security research. The company urges defenders to match the accelerating pace of AI-assisted exploitation with automated remediation and prioritization strategies.

Bugcrowd CEO Dave Gerry emphasized that the shrinking "zero-day clock" necessitates rethinking remediation pipelines. Organizations must develop AI-driven remediation at scale, moving fixes from ticket queues into near-real-time workflows. Prioritizing and acting on vulnerabilities that enable exploits is crucial to avoid being overwhelmed by the sheer volume of discovered flaws.

As AI models continue to improve, with Price predicting significant gains over the next two to four years, the cybersecurity landscape will face increasing pressure. Defenders must leverage contextual intelligence to prioritize and remediate the most critical vulnerabilities, ensuring they can keep pace with the evolving threat posed by AI-powered exploitation.

Synthesized by Vypr AI