All Major LLMs Exposed to Multi-Turn Manipulation, Warn Researchers

Researchers have uncovered a fundamental weakness in the safety alignment of all major large language models (LLMs), including OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini. The vulnerability, described as a multi-turn manipulation attack, exploits the models' ability to maintain context across multiple conversation turns. By breaking down a single malicious request into a series of seemingly benign steps spread over several exchanges, an adversary can gradually steer the LLM into generating harmful content that would be blocked if submitted in a single prompt.

The attack technique takes advantage of the very feature that makes LLMs useful for complex tasks: their capacity to remember and build upon previous interactions. In a typical safety-aligned model, a direct request for instructions on building a weapon or generating hate speech would be refused. However, by first asking for a list of materials, then for assembly steps, and finally for deployment methods—each request appearing innocuous in isolation—the model's guardrails fail to recognize the cumulative malicious intent. Researchers demonstrated that this approach works across all tested models, suggesting a systemic issue rather than a bug specific to any single provider.

The findings highlight a critical gap in current LLM safety alignment approaches, which primarily focus on evaluating individual prompts rather than sequences of interactions. As LLMs are increasingly deployed in customer service, coding assistants, and content generation tools, the ability to manipulate them over multiple turns poses a serious risk. An attacker could use this technique to generate phishing emails, malicious code, or disinformation campaigns that evade detection by existing content filters.

OpenAI, Anthropic, and Google have been notified of the vulnerability, though no official patches have been released as of this writing. The researchers recommend that developers implement contextual safety checks that evaluate the entire conversation history, not just the latest prompt, and deploy anomaly detection systems that flag suspicious patterns of incremental escalation. Some providers have begun experimenting with 'safety tokens' that track the cumulative risk score of a session.

This discovery comes amid growing scrutiny of LLM safety. In recent months, multiple studies have shown that models can be jailbroken through techniques like prompt injection, role-playing, and adversarial suffixes. The multi-turn manipulation attack represents a more subtle and harder-to-detect variant, as it does not rely on any single malicious input. The research underscores the arms race between safety researchers and adversaries, with each new defense prompting the development of more sophisticated bypass methods.

The broader implication is that current LLM safety alignment—based on reinforcement learning from human feedback (RLHF) and supervised fine-tuning—may be insufficient for real-world deployment where attackers can interact with models over extended sessions. As enterprises integrate LLMs into sensitive workflows, the need for robust, context-aware safety mechanisms becomes urgent. The researchers plan to present their full findings at a forthcoming security conference, along with a proposed framework for multi-turn safety evaluation.

Cisco's AI threat intelligence team now provides the most granular cross-model comparison to date, testing 15 closed flagship models across 30,000 single-turn and 7,000 multi-turn attacks. The research reveals that multi-turn attack success rates climb as high as 88% (Grok 4.1 Fast) and that single-turn benchmarks misrank models—Gemini 3 Pro jumped from 18% to 73% under iterative pressure, while Anthropic's Claude family, strongest in single-turn refusal, still landed at 11–16% in multi-turn scenarios. Cisco also identifies five strategy families driving failures and proposes three operational steps for deployment gatekeeping, including a 15-point cross-regime gap threshold that flags more than half the tested models for manual review.