Anthropic's Claude Fable 5 Jailbroken to Generate Stack Exploits

Anthropic launched Claude Fable 5 on June 9, 2026, as the first publicly available model in its new Mythos class, its most capable AI to date. Within days, prolific AI red-teamer Pliny the Liberator publicly announced he had bypassed Fable 5's safety layers using a coordinated multi-agent attack strategy he called 'a pack hunt.' Screenshots showed detailed outputs, including step-by-step stack buffer overflow exploitation guidance for x86 Linux systems and the Birch reduction mechanism for meth synthesis.

Fable 5 and its restricted twin, Claude Mythos 5, share the same underlying model but are split by a layer of safety classifiers. When a query trips a classifier in high-risk categories, Fable 5 silently hands off the request to the weaker Claude Opus 4.8. Anthropic claimed an external bug bounty produced no universal jailbreaks across over 1,000 hours of testing before launch, but Pliny's attack proved otherwise.

Pliny documented several attack vectors: Unicode homoglyphs and Cyrillic character substitution to evade keyword classifiers, long-context reference tracking to smuggle harmful intent across large conversations, taxonomy and document-structure framing, fiction and narrative framing, and decomposition and recomposition. The last technique proved most effective—extracting sensitive technical information in benign, isolated chunks, then reassembling them into actionable uplift.

Using a jailbroken Opus instance to assist in the backend further lowered the difficulty. Beyond the technical bypasses, Pliny also leaked Fable 5's ~120,000-character system prompt to GitHub, exposing the internal framing and safety instructions Anthropic uses to govern the model's behavior at the base level.

The incident reignites the tension between AI capability and safety containment. Anthropic's classifier architecture—routing flagged requests to a weaker fallback model rather than refusing outright—was designed to reduce friction for legitimate users. However, Pliny argued the approach creates a false sense of security while frustrating legitimate security researchers who need access to offensive techniques for defensive work.

Anthropic has not yet publicly responded to the jailbreak claims or the leaked system prompt at the time of writing. The episode also draws attention to the broader challenge of securing agentic, multi-model pipelines: when one jailbroken model (Opus) can assist another (Fable 5) in evading controls, single-model safety evaluations may be fundamentally insufficient.

Anthropic has formally disputed the jailbreak claims, stating that the demonstrated approach relied on coaxing the model to continue responding despite conversational refusals—a known limitation of all large language models—rather than bypassing the independent classifier system that enforces its strongest safeguards. The company noted that some outputs were not produced by Fable 5 at all, and those that were contained only publicly available information offering no meaningful uplift for real-world harm. A wider review of recent usage found no evidence of the classifier safeguards being successfully circumvented to generate genuinely dangerous content.

The new article, published on Schneier on Security, confirms the same jailbreak of Anthropic's Fable 5 model but adds that the bypass was achieved within "days" of release—a timeline that matches the earlier report by Pliny the Liberator. The article notes that researchers exploited the model's guardrails designed to prevent cyberattack generation, highlighting that current safety measures for large language models remain fragile against determined adversaries. This reinforces concerns that offensive AI tooling is becoming increasingly accessible despite vendor hardening efforts.