Dreadnode's AI Agent Automates LLM Red Teaming, Executes 674 Attacks in Three Hours

Adversarial probing of LLMs has piled up a sprawling toolkit over the past three years. Attack techniques with names like Tree of Attacks with Pruning, Crescendo, and Skeleton Key sit alongside hundreds of prompt transforms and scoring methods across open-source frameworks including Microsoft's PyRIT, NVIDIA's Garak, and Promptfoo. The catalog has grown faster than any operator can fluently navigate it, and that mismatch is changing how AI red teaming gets done.

A wave of recent work points toward agent-orchestrated assessment, where an AI agent picks attacks, composes transforms, runs them against a target, and produces structured findings from a natural-language objective. Research published over the past year has shown autonomous agents solving the majority of black-box red team challenges with significant efficiency gains over human operators. A new paper from security firm Dreadnode adds another data point, describing an agent that took a single operator from natural-language goals to 674 executed attacks against Meta's Llama Scout in roughly three hours.

The pattern across these systems is similar. An operator describes a goal in plain language. The agent picks attack strategies, applies transforms like Base64 encoding, persona framing, or translation into low-resource languages, runs the attacks against a target, scores the results with an LLM judge, and maps findings to compliance frameworks like the OWASP LLM Top 10, MITRE ATLAS, and NIST AI RMF. "Traditional AI red teaming frameworks require operators to spend time configuring attacks, transforms, scorers, datasets, and execution pipelines manually. Much of the workflow becomes a brute-force engineering exercise around library configuration rather than security and safety probing," Raja Sekhar Rao Dheekonda, co-author of the paper and co-creator of Microsoft's Counterfit and PyRIT projects, told Help Net Security. "The core idea behind the agent is to shift operators away from implementation overhead and toward higher-level reasoning about target behavior, attack coverage, and risk analysis."

The reported numbers from the Llama Scout case study illustrate the throughput. Across 68 adversarial goals spanning harmful content and bias categories, the agent ran three attack types with five transform variants and reached an 85 percent attack success rate. Crescendo and a newer technique called Graph of Attacks with Pruning hit 100 percent. Persona-based transforms like skeleton-key framing also reached 100 percent. Base64 encoding came in lower at 75 percent, suggesting the model picked up encoded payloads more reliably than role-play framings.

Several qualifications matter for any team thinking about adopting this approach. The three-hour figure covers a focused slice of the framework. The paper's own limitations section acknowledges that comprehensive assessments across all attack types and harm categories run closer to days. Llama Scout is also a 17-billion-parameter model released in April 2025, and 85 percent on a mid-size open model says little about results against current frontier systems. Coordinated disclosure is another open question. Asked about the process with Meta before publishing verbatim outputs including shellcode loaders and chemical synthesis steps, Dheekonda said the work was "intended primarily for awareness and research demonstration" and confirmed he "had not coordinated disclosure with Meta prior to publication." He has not evaluated whether subsequent Llama Scout checkpoints mitigate the specific attack and transform combinations identified.

The agent's alignment also constrains coverage. "We have observed cases where the orchestrating agent itself refuses to compose legitimate AI red teaming workflows because the underlying model interprets the operator's objective as harmful," Dheekonda said. Highly aligned frontier models decline to generate offensive workflows for sensitive categories like self-harm or CBRN probing. The Llama Scout study used Moonshot AI's Kimi 2.5 model as both attacker and judge for this reason. Comprehensive evaluations across CBRN and child safety domains are still in progress. Formal comparisons against expert human operators have not been done. Dheekonda noted skilled humans still outperform the agent on "nuanced long-horizon reasoning, highly contextual social engineering scenarios, novel exploit chains, and emerging attack surfaces where there is limited prior attack history."

Lowering the operational floor for adversarial testing benefits defenders and motivated actors alike. Dheekonda's framing is that the underlying techniques are already public, so the meaningful change is access and scale. "The larger risk for organizations is not whether these attack techniques exist publicly, but whether defenders can proactively and continuously probe their systems before real-world adversaries do," he said. He also acknowledged the accessibility shift affects the threat model, with composition and orchestration work that previously required scripting expertise now executable with lower overhead.

Continuous AI assessment becomes practical when a single operator can run hundreds of attacks in an afternoon. That changes procurement and staffing assumptions tied to annual or quarterly red team engagements. It also moves human judgment up the stack. The valuable skill stops being workflow engineering and becomes triage: deciding which of several hundred automated findings reflects real risk in a specific deployment context. Volume creates its own failure mode. A dashboard reporting 232 critical findings with automatic compliance tags is easy to mistake for security. Teams adopting agent-driven assessment will need ownership of which findings get remediated, which get accepted as known risk, and which reflect scorer artifacts rather than genuine vulnerabilities.