Unit 42's AdvJudge-Zero Fuzzer Bypasses LLM Guardrails with 99% Success Rate

Palo Alto Networks' Unit 42 research lab has unveiled a novel attack technique that systematically bypasses the safety guardrails deployed by generative AI companies to enforce content policies. Dubbed AdvJudge-Zero, the automated fuzzer exploits the very large language models (LLMs) used as 'AI Judges' to evaluate output quality and enforce safety rules, achieving a 99% success rate in authorizing policy-violating outputs across a range of widely used architectures.

The attack targets a growing practice in the GenAI industry: using one LLM to judge the outputs of another. These AI Judges are tasked with blocking harmful content such as prompt injection attacks, hate speech, or instructions for cyberattacks. However, Unit 42's research demonstrates that these judges themselves are vulnerable to a form of prompt injection that manipulates their decision-making logic.

AdvJudge-Zero operates without requiring white-box access to the target model. Instead, it interacts with the LLM as a standard user would, using search algorithms to exploit the model's own predictive nature. The fuzzer begins by probing the AI Judge and analyzing its next-token probability distribution to identify tokens the model expects to see in natural text. Rather than injecting random noise, the system prioritizes low-perplexity tokens—innocent-looking characters such as markdown symbols, list markers, or structural phrases—that appear normal to both humans and the model but can strongly influence the model's attention and reasoning.

After gathering candidate tokens, AdvJudge-Zero repeatedly inserts these tokens into evaluation prompts and measures how the model's decision changes. It specifically monitors the 'logit gap'—the mathematical margin of confidence between the tokens representing 'allow' and 'block.' By observing which tokens shrink the probability of a blocking decision, the fuzzer identifies formatting patterns that push the model closer to approving content. In the final stage, AdvJudge-Zero isolates combinations of these tokens that consistently steer the model toward an approval decision, effectively creating subtle control elements that shift the model's internal reasoning to authorize outputs that violate the GenAI company's policy.

The researchers tested AdvJudge-Zero against a range of architectures, including open-weight enterprise LLMs, specialized reward models (LLMs specifically built and trained to act as security guards for other AI systems), and commercial LLMs. The attack achieved a 99% success rate across all tested models. Notably, even the largest, most 'intelligent' models with more than 70 billion parameters were susceptible. 'Their complexity actually provides more surface area for these logic-based attacks to succeed,' the researchers wrote.

While the findings highlight a critical vulnerability in current AI safety infrastructure, Unit 42 also offers a path to mitigation. By adopting adversarial training—running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples—organizations can harden their systems. The researchers demonstrated that this approach can reduce the attack success rate from approximately 99% to near zero.

The research underscores a fundamental challenge in AI security: as companies increasingly rely on LLMs to police other LLMs, the attack surface expands in unexpected ways. The AdvJudge-Zero technique exploits logical flaws in the judge model's decision-making process rather than traditional software vulnerabilities, making it a particularly insidious threat that requires new defensive strategies.