This is interesting because it tells RL in this setup fails to switch Qwen1.5B reasoning to being illegible which looks like a good news given we massively train models with RL. Yet, that illegible CoT should not be that dangerous, imho, if it was elicited in this experiment, because the monitorability of CoT assumes legible text. Also, the motivation section conflates overt and covert encoded reasoning, which is important to distinguish because, again, illegible text should be flagged by a monitor. Furthermore, I agree that, given more compute or other methods, it might be possible to switch the reasoning to being illegible to us, because it is just one or another set of tokens.
I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying
I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor.”
This is interesting because it tells RL in this setup fails to switch Qwen1.5B reasoning to being illegible which looks like a good news given we massively train models with RL. Yet, that illegible CoT should not be that dangerous, imho, if it was elicited in this experiment, because the monitorability of CoT assumes legible text. Also, the motivation section conflates overt and covert encoded reasoning, which is important to distinguish because, again, illegible text should be flagged by a monitor. Furthermore, I agree that, given more compute or other methods, it might be possible to switch the reasoning to being illegible to us, because it is just one or another set of tokens.
I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying