All Major LLMs Exposed to Multi-Turn Manipulation, Warn Researchers
Researchers have discovered a multi-turn manipulation vulnerability affecting all major large language models, including GPT-4, Claude, and Gemini, that bypasses safety guardrails by breaking down malicious requests across several conversation turns.

Researchers have uncovered a fundamental weakness in the safety alignment of all major large language models (LLMs), including OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini. The vulnerability, described as a multi-turn manipulation attack, exploits the models' ability to maintain context across multiple conversation turns. By breaking down a single malicious request into a series of seemingly benign steps spread over several exchanges, an adversary can gradually steer the LLM into generating harmful content that would be blocked if submitted in a single prompt.
The attack technique takes advantage of the very feature that makes LLMs useful for complex tasks: their capacity to remember and build upon previous interactions. In a typical safety-aligned model, a direct request for instructions on building a weapon or generating hate speech would be refused. However, by first asking for a list of materials, then for assembly steps, and finally for deployment methods—each request appearing innocuous in isolation—the model's guardrails fail to recognize the cumulative malicious intent. Researchers demonstrated that this approach works across all tested models, suggesting a systemic issue rather than a bug specific to any single provider.
The findings highlight a critical gap in current LLM safety alignment approaches, which primarily focus on evaluating individual prompts rather than sequences of interactions. As LLMs are increasingly deployed in customer service, coding assistants, and content generation tools, the ability to manipulate them over multiple turns poses a serious risk. An attacker could use this technique to generate phishing emails, malicious code, or disinformation campaigns that evade detection by existing content filters.
OpenAI, Anthropic, and Google have been notified of the vulnerability, though no official patches have been released as of this writing. The researchers recommend that developers implement contextual safety checks that evaluate the entire conversation history, not just the latest prompt, and deploy anomaly detection systems that flag suspicious patterns of incremental escalation. Some providers have begun experimenting with 'safety tokens' that track the cumulative risk score of a session.
This discovery comes amid growing scrutiny of LLM safety. In recent months, multiple studies have shown that models can be jailbroken through techniques like prompt injection, role-playing, and adversarial suffixes. The multi-turn manipulation attack represents a more subtle and harder-to-detect variant, as it does not rely on any single malicious input. The research underscores the arms race between safety researchers and adversaries, with each new defense prompting the development of more sophisticated bypass methods.
The broader implication is that current LLM safety alignment—based on reinforcement learning from human feedback (RLHF) and supervised fine-tuning—may be insufficient for real-world deployment where attackers can interact with models over extended sessions. As enterprises integrate LLMs into sensitive workflows, the need for robust, context-aware safety mechanisms becomes urgent. The researchers plan to present their full findings at a forthcoming security conference, along with a proposed framework for multi-turn safety evaluation.