In a recent post, Zvi described what he calls “The Most Forbidden Technique”:
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
The article specifically discusses this in relation to reasoning models and Chain of Thought (CoT): if we train a model not to admit to lying in its CoT, it might still lie in the CoT and just not tell us.
This seems like a good idea, but something I read recently makes me wonder if we can really avoid this.
You Can Tell an LLM How to Think
@latentmoss on Twitter found that you can make Gemini 2.5 think in Kaomojis:
I wondered if this was Gemini-specific, but I was easily able to get Claude to think in Spanish on demand:
This is a little surprising, but it’s still reasoning and the language is irrelevant. If I ask a sufficiently hard problem, Claude initially thinks about math despite being told not to. Maybe we can prompt Claude to think in different but equivalent ways, but can’t force it to think in unhelpful ways?
Except, with a slight tweak to the prompt, I’m able to get Claude to reason in a way that gives it a wrong answer.
Isn’t This Weird?
If there’s no optimization pressure on the CoT, why does the AI do what you want when you tell it how to think? In my math problem, the only reward in training is the final answer, not the reasoning. So why does Claude dutifully think about cute animals and then give me a wrong answer?
The Fundamental Problem
I think the problem here is that the method [M] that generates the final output [X] is the same method that generates the Chain of Thought. You might imagine an architecture where one model is trained to talk to the user and another is trained to do reasoning, but all of the reasoning models I’m aware of 1 do both with the same model.
Since all of the outputs come from the same model, if we train the model on normal outputs, we automatically apply the same optimization pressure to the reasoning outputs as well.
A Concrete Prediction
Imagine we’re training a model and we penalize it when it outputs the number 5 (or any similar emoji, or vague references to 5 things). My theory is that even if optimizing 5′s would be useful in the Chain of Thought to do math, the model will still avoid talking about 5 in its thought process (and if we provide enough optimization pressure, it will learn to secretly think about 5 while not printing it to the user or in CoT).
What if we had two models?
If this problem is caused by using one model for both reasoning and output, could we avoid it by training two models?
The outer model talks to the user
The inner model does reasoning
We only train on the outer model’s ‘correctness’
This should reduce the optimization pressure for the inner model to hide its thoughts, since the only optimization pressure is to improve the outer model’s responses. You’d have to avoid training the outer model to report on misbehavior from the inner model (and training it to not talk about this might have problems of its own?).
I think the biggest issue here is efficiency:
Training this combo model would be much harder, since it’s hard to say which model is responsible when an output is right or wrong.
Running this model would likely be less efficient since you need to switch models every time you hit you a
<think>
or</think>
token.
I’m curious if anyone has tried this?
Do we know that the examples of Gemini thinking in kaomoji and Claude speaking in spanish, etc, are real?
I say that because ChatGPT doesn’t actually display its chain of thought to the user, so it’s possible neither does Gemini or Claude. ChatGPT has the chain of thought obfuscated into something more approachable to the user, as I understand it.
Claude shows the authentic chain of thought (unless the system flags the COT as unsafe, in which case the user will be shown an encrypted version). It sounds from an announcement tweet like Gemini does as well, but I couldn’t find anything definitive in the docs for that one.
I tried running this locally with Qwen3-4B just to be sure, and it doesn’t really follow the instructions, although it also doesn’t completely ignore them like it would if there was no optimization pressure on the CoT:
I think there’s an irrelevant link in the last bullet point.
Fixed, thanks!
I mean, “don’t include the trick in the training corpus” is an obvious necessary condition for the trick to work, but also can’t be sufficient (never mind that it’s not very realistic to begin with). If we could come up with the trick, a sufficiently smart AI can guess the trick could be pulled on it and can then decide whether it’s worth to assume it has and deceive us anyway.
I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.