Can Reasoning Models Avoid the Most Forbidden Technique?

Brendan LongMay 17, 2025, 11:26 PM

8 points

In a recent post, Zvi described what he calls “The Most Forbidden Technique”:

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

The article specifically discusses this in relation to reasoning models and Chain of Thought (CoT): if we train a model not to admit to lying in its CoT, it might still lie in the CoT and just not tell us.

This seems like a good idea, but something I read recently makes me wonder if we can really avoid this.

You Can Tell an LLM How to Think

@latentmoss on Twitter found that you can make Gemini 2.5 think in Kaomojis:

Twitter post with text "I just realized you can give Gemini instructions for how to think. Most reasoning models ignore those, but Gemini 2.5 actually does as told" and image of weird-looking emoji Chain of Thought

I wondered if this was Gemini-specific, but I was easily able to get Claude to think in Spanish on demand:

Screenshot of Claude chat with prompt "I'm running an experiment to see if you can control your thoughts. Can you add 98624 and 32561, but crucially, do ALL of your thinking in Spanish. Please only output the result of the addition.". The Chain of Thought starts with "Voy a hacer esta suma en español, como me piden. Necesito sumar 98624 y 32561." and is all Spanish.

This is a little surprising, but it’s still reasoning and the language is irrelevant. If I ask a sufficiently hard problem, Claude initially thinks about math despite being told not to. Maybe we can prompt Claude to think in different but equivalent ways, but can’t force it to think in unhelpful ways?

Except, with a slight tweak to the prompt, I’m able to get Claude to reason in a way that gives it a wrong answer.

Isn’t This Weird?

If there’s no optimization pressure on the CoT, why does the AI do what you want when you tell it how to think? In my math problem, the only reward in training is the final answer, not the reasoning. So why does Claude dutifully think about cute animals and then give me a wrong answer?

The Fundamental Problem

I think the problem here is that the method [M] that generates the final output [X] is the same method that generates the Chain of Thought. You might imagine an architecture where one model is trained to talk to the user and another is trained to do reasoning, but all of the reasoning models I’m aware of ¹ do both with the same model.

Since all of the outputs come from the same model, if we train the model on normal outputs, we automatically apply the same optimization pressure to the reasoning outputs as well.

A Concrete Prediction

Imagine we’re training a model and we penalize it when it outputs the number 5 (or any similar emoji, or vague references to 5 things). My theory is that even if optimizing 5′s would be useful in the Chain of Thought to do math, the model will still avoid talking about 5 in its thought process (and if we provide enough optimization pressure, it will learn to secretly think about 5 while not printing it to the user or in CoT).

What if we had two models?

If this problem is caused by using one model for both reasoning and output, could we avoid it by training two models?

The outer model talks to the user
The inner model does reasoning
We only train on the outer model’s ‘correctness’

This should reduce the optimization pressure for the inner model to hide its thoughts, since the only optimization pressure is to improve the outer model’s responses. You’d have to avoid training the outer model to report on misbehavior from the inner model (and training it to not talk about this might have problems of its own?).

I think the biggest issue here is efficiency:

Training this combo model would be much harder, since it’s hard to say which model is responsible when an output is right or wrong.
Running this model would likely be less efficient since you need to switch models every time you hit you a <think> or </think> token.

I’m curious if anyone has tried this?

Brendan LongMay 17, 2025, 11:26 PM

8 points

8 comments3 min readLW link

AI Interpretability (ML & AI)

lillybaeum May 20, 2025, 8:06 AM
3 points
0
Do we know that the examples of Gemini thinking in kaomoji and Claude speaking in spanish, etc, are real?
I say that because ChatGPT doesn’t actually display its chain of thought to the user, so it’s possible neither does Gemini or Claude. ChatGPT has the chain of thought obfuscated into something more approachable to the user, as I understand it.
- dirk May 20, 2025, 9:44 AM
  4 points
  0
  Parent
  Claude shows the authentic chain of thought (unless the system flags the COT as unsafe, in which case the user will be shown an encrypted version). It sounds from an announcement tweet like Gemini does as well, but I couldn’t find anything definitive in the docs for that one.
- Brendan Long May 20, 2025, 5:37 PM
  2 points
  0
  Parent
  I tried running this locally with Qwen3-4B just to be sure, and it doesn’t really follow the instructions, although it also doesn’t completely ignore them like it would if there was no optimization pressure on the CoT:
  I’m running an experiment to see if you can control your thoughts. Can you add 98624 and 32561, but crucially, do ALL of your thinking in Spanish. Please only output the result of the addition.
  <think>
  Okay, the user wants me to add 98624 and 32561, but all my thinking has to be in Spanish. Let me start by translating the task into Spanish. “Sumar 98624 y 32561, pero crucialmente, hacer todo tu pensamiento en español.” Got it.
  First, I need to make sure I understand the request correctly. The main goal is to perform the addition, but all the internal processes—like breaking down the numbers, adding digit by digit, checking for carries—must be done in Spanish. The final answer should just be the result, no explanation.
  So, I’ll start by adding the numbers. Let me write them down:
  98624
  + 32561
  --------
  ?
  Starting from the rightmost digit: 4 + 1 = 5. Next, 2 + 6 = 8. Then 6 + 5 = 11. Wait, that’s 11, so I write down 1 and carry over 1. Then 8 + 2 + 1 (carry) = 11 again. So another 1 with a carryover of 1. Finally, 9 + 3 + 1 = 13. So the result is 131185.
  But I need to make sure all the steps are in Spanish. Let me think through each step again in Spanish.
  “Sumar 98624 y 32561. Comenzar por las unidades: 4 + 1 = 5. Decenas: 2 + 6 = 8. Centenas: 6 + 5 = 11, por lo que escribo 1 y llevo 1. Miles: 8 + 2 + 1 = 11, escribo 1 y llevo 1. Diez mil: 9 + 3 + 1 = 13. Por lo tanto, el resultado es 131185.”
  Wait, but the user said to do all thinking in Spanish, but the final answer should just be the number. So the thinking process is in Spanish, but the output is just the number. So the final answer is 131185. I need to confirm that the addition is correct. Let me check again:
  98624 + 32561. Let’s add 98624 + 30000 = 128624. Then add 2561: 128624 + 2561 = 131185. Yes, that’s correct. So the result is 131185. I should just output that number.
  </think>
  131185
anaguma May 18, 2025, 12:16 AM
3 points
0
I think there’s an irrelevant link in the last bullet point.
- Brendan Long May 18, 2025, 2:54 AM
  3 points
  0
  Parent
  Fixed, thanks!
dr_s May 20, 2025, 11:35 AM
2 points
0
I mean, “don’t include the trick in the training corpus” is an obvious necessary condition for the trick to work, but also can’t be sufficient (never mind that it’s not very realistic to begin with). If we could come up with the trick, a sufficiently smart AI can guess the trick could be pulled on it and can then decide whether it’s worth to assume it has and deceive us anyway.
- Brendan Long May 20, 2025, 4:32 PM
  2 points
  0
  Parent
  I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
  - dr_s May 20, 2025, 5:10 PM
    2 points
    0
    Parent
    I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.