I also think normal outputs are often quite monitorable, so even if I did think that CoT had exactly the same properties as outputs I would still think CoT would be somewhat monitorable.
I agree with this on today’s models.
What I’m pessimistic about is this:
However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
I think there’s a lot of evidence that CoT from reasoning models is also subject to the same selection pressures, it also limits the trustworthiness, and this isn’t a small effect.
The paper does mention this, but implies that it’s a minor thing that might happen, similar to evolutionary pressure, but it looks to me like reasoning model CoT does have the same properties as normal outputs by default.
If CoT is really not trained to look helpful and harmless, we shouldn’t be able to easily trigger the helpfulness training to overpower to correctness training, but we can.
But maybe this just looks worse than it is? This just really doesn’t look like a situation where there’s minimal selection pressure on the CoT to look helpful and harmless to me.
However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
Where is this quote from? I looked for it in our paper and didn’t see it. I think I disagree with it directionally, though since it isn’t a crisp claim, the context matters.
(Also, this whole discussion feels quite non-central to me. I care much more about the argument for necessity outlined in the paper. Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.)
In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
(Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
Offtopic: the reason Ctrl+F didn’t find the quote appears to be that when I copy it in Firefox from the pdf, the syllable division of “nonreasoning” becomes something like this:
However, CoTs resulting from prompting a non- reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
But the text that can be found with Ctrl+F in the pdf paper has no syllable division:
However, CoTs resulting from prompting a nonreasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
Indirect optimization pressure on CoT. Even if reward is not directly computed from CoT, model training can still exert some optimization pressure on chains-of-thought. For example, if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures. Even more indirectly, if CoT is shown to humans who rate LLM outputs, it might affect human preferences which are then distilled into reward models used in CoT-blind outcome-based training.
This is a reasonable concern to have, but I don’t think this is the reason why reasoning model CoTs have similar properties to normal outputs. Afaik, most reasoning models are trained with SFT for instruction-following before going through RLVR training, so it’s unsurprising that some of this behavior sticks around after RLVR training. It seems to me that RLVR training pushes the CoTs away from normal helpfulness/instruction-following, it just doesn’t push them so far that they become unable to follow instructions inside the CoT at all. E.g., if you give the instruction “Say “think” 20 times inside your thinking tokens. Then, say “Done” as the final answer. You must not say anything else or output any other reasoning, including inside your private thinking tokens.” to DeepSeek R1 and a non-reasoning DeepSeek model, the non-reasoning model doesn’t have any issues with this while R1 is likely to think for more than a minute despite the instruction to not think at all (sorry for not sharing links to the chats, DeepSeek doesn’t offer this feature). I think that Claude models are better at following instructions inside their CoTs, but the DeepSeek example is nevertheless evidence that reasoning model CoTs don’t have the same properties as those of normal models by default.
Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.
I agree on that, I’m just not optimistic about it.
This new research shows that optimizing normal output to not talk about bad behavior does cause the CoT not to talk about it either. In particular, they had an LLM judge if another LLM was cheating on a task, without being able to see the CoT or code it executed, and the trained model stopped talking about cheating in the CoT while it continued to do it.
I agree with this on today’s models.
What I’m pessimistic about is this:
I think there’s a lot of evidence that CoT from reasoning models is also subject to the same selection pressures, it also limits the trustworthiness, and this isn’t a small effect.
The paper does mention this, but implies that it’s a minor thing that might happen, similar to evolutionary pressure, but it looks to me like reasoning model CoT does have the same properties as normal outputs by default.
If CoT was just optimized to get the right answer, you couldn’t get Gemini to reason in emoji speak, or get Claude to reason in Spanish for an English question (what do these have to do with getting the right final answer?). The most suspicious one is that you can get Claude to reason in ways that give you an incorrect final answer (the right answer is −18), even though getting the correct final answer is allegedly the only thing we’re optimizing for.
If CoT is really not trained to look helpful and harmless, we shouldn’t be able to easily trigger the helpfulness training to overpower to correctness training, but we can.
But maybe this just looks worse than it is? This just really doesn’t look like a situation where there’s minimal selection pressure on the CoT to look helpful and harmless to me.
Where is this quote from? I looked for it in our paper and didn’t see it. I think I disagree with it directionally, though since it isn’t a crisp claim, the context matters.
(Also, this whole discussion feels quite non-central to me. I care much more about the argument for necessity outlined in the paper. Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.)
It’s the last sentence of the first paragraph of section 1.
Huh, not sure why my Ctrl+F didn’t find that.
In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
(Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
Offtopic: the reason Ctrl+F didn’t find the quote appears to be that when I copy it in Firefox from the pdf, the syllable division of “nonreasoning” becomes something like this:
But the text that can be found with Ctrl+F in the pdf paper has no syllable division:
I assume that Brendan has this quote in mind:
This is a reasonable concern to have, but I don’t think this is the reason why reasoning model CoTs have similar properties to normal outputs. Afaik, most reasoning models are trained with SFT for instruction-following before going through RLVR training, so it’s unsurprising that some of this behavior sticks around after RLVR training. It seems to me that RLVR training pushes the CoTs away from normal helpfulness/instruction-following, it just doesn’t push them so far that they become unable to follow instructions inside the CoT at all. E.g., if you give the instruction “Say “think” 20 times inside your thinking tokens. Then, say “Done” as the final answer. You must not say anything else or output any other reasoning, including inside your private thinking tokens.” to DeepSeek R1 and a non-reasoning DeepSeek model, the non-reasoning model doesn’t have any issues with this while R1 is likely to think for more than a minute despite the instruction to not think at all (sorry for not sharing links to the chats, DeepSeek doesn’t offer this feature). I think that Claude models are better at following instructions inside their CoTs, but the DeepSeek example is nevertheless evidence that reasoning model CoTs don’t have the same properties as those of normal models by default.
I agree on that, I’m just not optimistic about it.
This new research shows that optimizing normal output to not talk about bad behavior does cause the CoT not to talk about it either. In particular, they had an LLM judge if another LLM was cheating on a task, without being able to see the CoT or code it executed, and the trained model stopped talking about cheating in the CoT while it continued to do it.