Rauno Arike comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rauno Arike 17 Jul 2025 11:22 UTC
5 points
2
I assume that Brendan has this quote in mind:
Indirect optimization pressure on CoT. Even if reward is not directly computed from CoT, model training can still exert some optimization pressure on chains-of-thought. For example, if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures. Even more indirectly, if CoT is shown to humans who rate LLM outputs, it might affect human preferences which are then distilled into reward models used in CoT-blind outcome-based training.
This is a reasonable concern to have, but I don’t think this is the reason why reasoning model CoTs have similar properties to normal outputs. Afaik, most reasoning models are trained with SFT for instruction-following before going through RLVR training, so it’s unsurprising that some of this behavior sticks around after RLVR training. It seems to me that RLVR training pushes the CoTs away from normal helpfulness/instruction-following, it just doesn’t push them so far that they become unable to follow instructions inside the CoT at all. E.g., if you give the instruction “Say “think” 20 times inside your thinking tokens. Then, say “Done” as the final answer. You must not say anything else or output any other reasoning, including inside your private thinking tokens.” to DeepSeek R1 and a non-reasoning DeepSeek model, the non-reasoning model doesn’t have any issues with this while R1 is likely to think for more than a minute despite the instruction to not think at all (sorry for not sharing links to the chats, DeepSeek doesn’t offer this feature). I think that Claude models are better at following instructions inside their CoTs, but the DeepSeek example is nevertheless evidence that reasoning model CoTs don’t have the same properties as those of normal models by default.