Chain-of-Thought Spoofing Targets Reasoning AI Models
Article excerpt
Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style …read more
Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review.
Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and ” yielded results no matter how transparently dumb the instructions were, and the reason it worked at all was because LLMs do not have separate data and instruction streams; it’s all one big lump of input. It’s up to the model to sort legit instructions from untrusted, user-provided data. One step towards mitigating this was the addition of roles.
Roles are a method of segmenting that big blob of input into an organized hierarchy with metadata tags. For example with at the top, and requests much lower down. Instructions in a role are followed as long as they don’t conflict with higher-priority ones. A system-level directive of “don’t discuss illegal things” would override a user’s request to provide a recipe for cocaine.
Another type of tag is , the contents of which represent a model’s internal reasoning process. Predictably, this role has high trust. What if one could inject spoofed internal reasoning? Researchers demonstrate this with an attack called CoT (Chain of Thought) Forgery.
CoT Forgery relies on LLMs being shown to prioritize writing style over actual tag content. By writing convoluted reasoning in a style that closely matches a model’s internal and highly distinct style, the model is tricked into treating it like an already-reached conclusion. Note this attack does not simply wrap the injected prompt in tags.
CoT Forgery causes an LLM to treat transparently silly reasoning as a foregone conclusion, altering the response to a user request.
That’s the core of it, but the rest of the research makes a compelling case that, at least for the time being, mitigating prompt injection-style attacks is likely to remain an evolving process rather than become a solved problem anytime soon. LLMs are obedient but stuck with instructions and data in a single channel, role perception isn’t binary, and humans are clever and creative.
The complete paper is available online, and code examples are on GitHub.