EVOHUNT: $1,400 AI Security Auditing System Beats OpenAI's Codex Security at Finding Real Bugs

A research team has built a system that teaches AI agents to hunt for software bugs by writing the audit method down as plain text. The system, called EVOHUNT, keeps the underlying AI model fixed and improves only an external “playbook” that tells the agent how to work.

One result stands out for anyone buying security tools. An open-source model running an evolved playbook found real vulnerabilities at a higher rate than OpenAI’s commercial Codex Security product, 11.3 percent against 9.2 percent across 371 test cases.

Most attempts to make an AI auditing agent better swap in a bigger model, new operating software, and a fresh workflow all at once. The gain then gets credited to the model. EVOHUNT pulls these apart. It locks the model and its operating software in place and lets a text document improve on its own. The setup is a loop of three agents. One audits a codebase and reports what it finds. A second checks those findings against known answers. A third rewrites the playbook based on the mistakes. The playbook starts empty and grows with each accepted edit, every version saved like code in a git repository. The two playbooks the team grew this way ended up at roughly 1,600 and 2,200 lines of procedure that the agent wrote itself.

The test set comes from the GitHub Advisory Database and is split by date. The agent learns on bugs disclosed from 2023 through 2025, then gets tested on bugs disclosed in 2026, so it has never seen the answers. Each case runs inside a sandbox, and the team kept only serious bugs that an outside attacker could reach. Adding an evolved playbook to the closed-source GPT agent multiplied its working exploits sixfold. The open-source GLM agent reached 11.3 percent and passed Codex Security on every measure the team tracked.

The economics are where this gets interesting for security teams. The whole teaching campaign ran on subscription accounts for one month and cost around $1,400. After that, the actual auditing runs on cheaper open-source models. Putting an evolved playbook on a small Qwen model recovered most of the performance at roughly a third of the cost per case. The system has already produced 28 confirmed vulnerability disclosures across 18 open-source projects, with one $1,500 bug bounty award.

Left to grow on their own, the two playbooks settled into opposite styles. The GPT-grown playbook turned into a precision tool. It limits how many bug types it chases at once and refuses to report anything without a working, reproduced exploit behind it. In a sample of its findings, none were false alarms. The GLM-grown playbook went the other way and became an exhaustive sweeper. It repeats orders to keep looking more than thirty times and accepts thinner evidence to cover more ground. It caught more bugs overall and left more findings for a human to sort through. The choice between them mirrors the daily call security teams make between a short list they can trust and a long list they have to triage.

The part that gives the work its reach is transfer. A playbook grown by a stronger teacher model made weaker, cheaper models substantially better at the same job. In plain terms, the expertise lives in a text file that any compatible model can pick up. One organization can pay for the expensive teaching step once, then run the resulting playbook on inexpensive models for as long as it likes. Ziyue Wang, a co-author of the paper, told Help Net Security that fusing the precision of the GPT playbook with the breadth of the GLM playbook would enhance the agent’s overall vulnerability-hunting capabilities.

The commercial yardstick, Codex Security, is a separate product, so the model and operating software underneath it differ from the EVOHUNT runs. A cleaner test would pit an evolved playbook against an expert-written one inside the very same agent. Wang said that kind of baseline is hard to get because many top-tier expert workflows are proprietary. The team wanted to test against Anthropic’s Mythos but lacked access. Wang ties the design to a longer-running bet in AI research, aligning with Rich Sutton’s “The Bitter Lesson” — the historical observation that general, scalable methods leveraging computation ultimately outperform those that rely heavily on human-encoded domain knowledge. The bug count keeps rising: six more zero-days have been confirmed since the paper was published.