Artem Karpov comments on Training Qwen-1.5B with a CoT legibility penalty

Artem Karpov 30 Oct 2025 14:37 UTC
1 point
0
This is interesting because it tells RL in this setup fails to switch Qwen1.5B reasoning to being illegible which looks like a good news given we massively train models with RL. Yet, that illegible CoT should not be that dangerous, imho, if it was elicited in this experiment, because the monitorability of CoT assumes legible text. Also, the motivation section conflates overt and covert encoded reasoning, which is important to distinguish because, again, illegible text should be flagged by a monitor. Furthermore, I agree that, given more compute or other methods, it might be possible to switch the reasoning to being illegible to us, because it is just one or another set of tokens.
- Fabien Roger 31 Oct 2025 13:50 UTC
  2 points
  0
  Parent
  I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying
  I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor.”