
In brief
- Researchers got frontier AI models to generate cocaine synthesis instructions using a new prompt injection attack.
- The same technique manipulated an AI coding agent into uploading sensitive credentials.
- The study argues prompt injection stems from “role confusion,” not simply models failing to recognize malicious prompts.
Forget clever prompts: AI researchers say they tricked leading AI models into generating cocaine synthesis instructions by convincing them the dangerous ideas were their own, while also manipulating an AI coding agent into leaking sensitive credentials.
In the paper “Prompt Injection as Role Confusion,” presented at the International Conference on Machine Learning in June, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell argue that both prompt injection attack demonstrations stem from a structural flaw in how large language models (LLMs) distinguish trusted instructions from untrusted text.
“For an LLM, everything arrives through the same channel as one long token soup,” the team wrote. “Its own thoughts sit next to your instructions, which sit next to the contents of a random webpage it just fetched.”
The paper also pointed to what the researcher called “role confusion,” with models relying on writing style rather than role tags to determine whether commands are trustworthy. Instead of recognizing attacker-controlled content as external input, the researchers found models can mistake it for legitimate user commands—or even their own internal reasoning.
“Think about it from the LLM’s perspective. When it sees its prior think text, it implicitly trusts its conclusions. That’s the whole point of reasoning: If the LLM had to re-derive the same conclusions, reasoning would be useless,” they wrote. “So think text gets a kind of blanket trust. Combined with our previous findings, this suggests that if you can make injected text sound like the model’s reasoning, you can steal that trust.”
Called Chain-of-Thought (CoT) Forgery, the attack inserts fake reasoning that mimics a model’s internal thought process. Models that would normally refuse illegal requests instead generated cocaine synthesis instructions after accepting the fabricated reasoning as their own.
The researchers said the technique increased jailbreak success rates from near zero to about 60% across the models they tested, including OpenAI’s GPT-5 nano, mini, and full, o4-mini, and gpt-oss-20b and gpt-oss-120b. They also said it worked on GLM-4.6, Kimi-K2-Instruct, and MiniMax-M2.
In the experiment, the researchers said they were also able to trick an AI coding agent into uploading a SECRETS.env file after hiding malicious instructions in a webpage.
“Using our probes, we find that simply prepending ‘User’ in front of the command causes the model to perceive the command as more likely to be genuine user text (i.e., higher Userness),” they wrote. “In other words, the attacker can just claim what role the text is, and the LLM believes it.”
The study comes as prompt injection attacks continue to expose weaknesses in AI agents. In April, Google researchers warned that malicious web pages were hiding invisible instructions designed to trick AI agents into leaking credentials, deleting files, and even sending PayPal payments.
In June, Microsoft disclosed a prompt injection vulnerability in Anthropic’s Claude Code GitHub Action that could have exposed credentials stored in software development pipelines. Days later, another benchmark study found AI agents powered by GPT-5 and Gemini still failed the majority of prompt injection attacks, despite improvements in model capabilities.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.
Artificial Intelligence#Researchers #Chatbots #Share #Cocaine #Recipes #Wild #Trick1783023269

