XBOW Tests Confirm Mythos AI Excels at Vulnerability Discovery but Falls Short on Cost and Judgment

Anthropic's Mythos AI model, announced in early April, has been touted for its ability to discover significantly more vulnerabilities than any other AI model. Independent testing by autonomous offensive security firm XBOW has now put those claims to the test, and the results are nuanced: Mythos is indeed a leap forward in certain contexts, but its high cost and mixed performance in other areas mean it is not a universal solution.

XBOW's testing confirmed Anthropic's primary claim: "Mythos Preview presents a significant step up over all existing models, regardless of provider." However, the firm found that the model's effectiveness depends heavily on the testing environment. When given both source code and a live, running environment, Mythos excelled at finding vulnerabilities. But when working with source code alone, its performance was notably weaker. As Gary McGraw noted two decades ago, operational defects arise from the interaction between code bugs and design flaws, and a higher-level understanding is required to catch design defects—something Mythos appears to leverage when given a live system.

Beyond raw discovery, XBOW evaluated Mythos's judgment, reverse engineering capabilities, and visual acuity. The model rejected false positives better than its predecessors, but it sometimes lost true positives when evidence did not formally satisfy its criteria. In reverse engineering tests, Mythos proved capable of triaging its own results and those of competitor models from competitors, and it could reason through unusual firmware and embedded systems contexts. Its visual acuity—the ability to interact with live websites through a browser—was "practically effective" at selecting the right browser actions, though not perfectly pixel-accurate.

The most significant caveat is cost. Mythos Preview is a "true titan," and titans are expensive: Anthropic has said it will be five times as expensive as an Opus model. XBOW questioned whether a cheaper model given more time could achieve greater accuracy at lower cost. The answer was yes. When normalized by estimated running cost, Mythos Preview is not best-in-class on XBOW's benchmarks. For finding web vulnerabilities with a fixed token budget, Mythos outperforms Opus 4.6 but is outperformed by GPT5.5.

XBOW's overall assessment is that Mythos is extremely powerful for source code audits, strong in native-code vulnerability discovery and reverse engineering, but less powerful at validating exploits. Its judgment is mixed—it can be too literal and conservative, and it tends to overstate the practical relevance of its findings. "Mythos Preview is strong at finding candidate vulnerabilities, especially from source code, and shows impressive ability across web, native-code, and reverse-engineering tasks," XBOW concluded.

The findings come amid a surge of interest in AI-powered vulnerability discovery. Related reports have covered Mythos finding 271 Firefox vulnerabilities and only one Curl vulnerability, sparking debate among experts. The Cybersecurity and Infrastructure Security Agency (CISA) has also urged CISOs to prepare for accelerated AI threats, and firms like Sweet Security have launched agentic AI red-teaming services to counter the "Mythos moment." XBOW's testing provides a critical, independent benchmark that helps security teams understand where Mythos truly shines—and where it does not.