Fabien Roger comments on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Fabien Roger 18 Apr 2025 9:45 UTC
2 points
0
I agree some form of obfuscation is plausible in these scenarios and Deepseek was trained against a strong CoT monitor + outcome-based feedback that would have been more interesting to study than math.
My intuition for why Math is still interesting: AI companies (e.g. Deepseek / OpenAI) will do A LOT of RL on math / programming. How likely encoded reasoning is to emerge is a function of how much RL you do on a given task, and how strong incentives for encoded reasoning are. Currently people don’t seem to be doing much optimization against CoT in big reasoning RL, but they are doing lots of RL in settings where encoded reasoning may have been incentivized because English might be inefficient (in length and expressiveness). Therefore, the big reasoning RL tasks like math are where encoded reasoning is among the most likely tasks to be observed if models learned encoded reasoning during RL.