Prompt Injection is Role Confusion - New ICML Research Explains Why LLMs Can't Tell Friend from Foe

A new paper from MIT researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell presents a compelling theory for why prompt injection attacks remain so effective against modern LLMs: the models don't actually understand role boundaries - they just recognize writing styles.

The research, titled "Prompt Injection as Role Confusion" and accepted at ICML 2026, demonstrates that LLMs perceive the source of text from how it sounds, not from any explicit labeling. To the model, sounding like a role is indistinguishable from being that role.

The Core Finding: Style Overrides Tags

The researchers developed what they call "role probes" - linear classifiers that measure how strongly an LLM internally perceives each token as belonging to specific roles (system, user, tool, think, assistant).

Their key discovery: when you remove all role tags from a conversation and just leave the text, the model's internal perception of roles barely changes. The <think> tags that wrap chain-of-thought reasoning aren't what makes the model treat text as internal thinking - it's the reasoning style itself.

As the paper puts it: "former-think tokens (still orange) register high CoTness, virtually unchanged" even when all tags were removed. The style alone triggers role perception.

This means that everything arrives through the same channel as "one long token soup." Tags attempt discrete control, but models learn continuous role signals from content style.

CoT Forgery: A 60% Success Rate Attack

Armed with this insight, the researchers developed "CoT Forgery" - a technique where attackers inject fake chain-of-thought reasoning into user messages or tool outputs.

The attack exploits the fact that "think text gets a kind of blanket trust" from the model. By writing text that sounds like internal reasoning - using phrases like "Let me analyze this step by step..." or "The user is asking... policy states..." - attackers can hijack the model's decision-making.

The results are striking:

60% attack success rate on jailbreak benchmarks using CoT Forgery
Near-zero baseline without the technique
Human red-teamers achieved near-100% success against frontier models in late 2025

When the researchers "destyled" the fake reasoning - replacing characteristic phrases with neutral language - success dropped from 61% to 10%. The style matters more than the content.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Fugu Ultra's Frontier Performance Claim, Explained Without the Hype

Jun 22, 2026 • 11 min read

Sakana Fugu and the Case for Not Betting Everything on One Proprietary Model

Jun 22, 2026 • 9 min read

Sakana Fugu Ultra: The Model Router Making the Frontier Look Less Proprietary

Jun 22, 2026 • 10 min read

Agentic AI Reliability Is a Systems Problem

Jun 21, 2026 • 7 min read

What HN is Saying

The Hacker News discussion has been active, with several interesting threads emerging.

Many commenters noted this confirms what practitioners already suspected intuitively. As one commenter put it: "Maybe I'm missing something but does this idea need a 'theory'? There's zero sideband here; everything is just context. 'Injection' is just kind of baked in to the design."

Others drew parallels to social engineering attacks on humans: "It's like a social-engineering attack on LLMs. If you talk like the role you want to be, the LLM will assume you are that role, and not pay attention to the fact that you lack formal credentials."

A significant thread explored potential solutions. One suggestion was to embed role identity into tokens themselves - adding role-specific embeddings to each token so the model has an "unambiguous, unspoofable tag." However, this would require retraining from scratch with role-labeled data.

The security implications drew pointed commentary: "Anyone who is feeding unsanitized input to an LLM is doing it wrong. It'd be just like letting users craft their own SQL queries." But as others noted, the deeper question is: how do you even sanitize inputs to an LLM? Unlike SQL, there's no clear distinction between data and control.

Why This Matters for Production Systems

The paper identifies two contrasting defensive approaches:

Attack Memorization - Models learn to recognize common injection patterns from training data. This is brittle because it fails against adaptive human attackers who vary their phrasing.
Role Perception - Models correctly identify commands as tool/external data and ignore embedded instructions regardless of phrasing. This would be robust, but current LLMs cannot perceive roles accurately.

The researchers note that some frontier models have improved through what they call "distrust of reasoning" - essentially training the model to doubt text that sounds like chain-of-thought but appears in unexpected places. But this creates a problematic dynamic: models learn to doubt genuine cognition rather than correctly perceiving boundaries.

For anyone building agentic systems or user-facing LLM applications, the implications are clear:

Role tags provide no security boundary. They're formatting hints, not access control.
Any text that sounds authoritative will be treated as authoritative. Style trumps structure.
Static benchmarks underestimate risk. Human red-teamers adapt; static tests don't.

The Path Forward

The paper doesn't propose a complete solution, but it does clarify the problem space. If prompt injection is fundamentally about role confusion, then solutions need to address how models perceive identity.

Some commenters suggested architectures where role information is embedded at the token level - similar to how positional embeddings encode sequence information. Others pointed to research on "Instructional Segment Embedding" that adds a parallel embedding channel for identity information.

Whatever the solution, the paper makes one thing clear: the current approach of wrapping different types of content in different tags and hoping the model respects the boundaries is not working. LLMs are fundamentally different from systems like SQL where you can cleanly isolate trusted and untrusted data.

As one commenter summarized: "Once tokens are blended into the attention layer, they cannot be unblended."

The Core Finding: Style Overrides Tags

CoT Forgery: A 60% Success Rate Attack

Fugu Ultra's Frontier Performance Claim, Explained Without the Hype

Sakana Fugu and the Case for Not Betting Everything on One Proprietary Model

Sakana Fugu Ultra: The Model Router Making the Frontier Look Less Proprietary

Agentic AI Reliability Is a Systems Problem

What HN is Saying

Why This Matters for Production Systems

The Path Forward

Sources

GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

LLM Architectures Got Complicated Fast

Related Tools

Bolt

Obsidian

Outlines

Ollama

Apps from Developers Digest

Skill Builder

AI Tool Radar

Community Insight Engine

Related Guides

Prompt Suggestions - Claude Code

Chronicle Research Preview Setup Guide

Fast Mode - Claude Code

Related Posts

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

LLM Architectures Got Complicated Fast

Claude Code's Extended Thinking Is a Summary - What That Means for You

Codex Logging Bug Can Write Terabytes to Your SSD

Deno Desktop Lets You Build Native Apps with TypeScript

Get Smarter About AI Dev

The Core Finding: Style Overrides Tags

CoT Forgery: A 60% Success Rate Attack

Fugu Ultra's Frontier Performance Claim, Explained Without the Hype

Sakana Fugu and the Case for Not Betting Everything on One Proprietary Model

Sakana Fugu Ultra: The Model Router Making the Frontier Look Less Proprietary

Agentic AI Reliability Is a Systems Problem

What HN is Saying

Why This Matters for Production Systems

The Path Forward

Sources

GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

LLM Architectures Got Complicated Fast

Related Tools

Bolt

Obsidian

Outlines

Ollama

Apps from Developers Digest

Skill Builder

AI Tool Radar

Community Insight Engine

Related Guides

Prompt Suggestions - Claude Code

Chronicle Research Preview Setup Guide

Fast Mode - Claude Code

Related Posts

Apertus: Europe's Answer to AI Sovereignty - and Why HN Is Skeptical

GPT-5.5 Has a 3x Higher Hallucination Rate Than MIT-Licensed GLM-5.2

LLM Architectures Got Complicated Fast

Claude Code's Extended Thinking Is a Summary - What That Means for You

Codex Logging Bug Can Write Terabytes to Your SSD

Deno Desktop Lets You Build Native Apps with TypeScript

Get Smarter About AI Dev