ankitmaloo comments on [Research Note] Optimizing The Final Output Can Obfuscate CoT

ankitmaloo 31 Jul 2025 20:42 UTC
2 points
0
This is my first assumption as well. If a model understands certain words are bad for the goal of achieving the reward, it’s less likely to output those words. This is convergence, and possibly the simplest explanation of the phenomenon.