lukemarks comments on [Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks 31 Jul 2025 21:10 UTC
5 points
0
We haven’t experimented with models other than Qwen3-4B, but we will in follow-up work.
On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.