You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
I said that the degenerate looping behavior looked non-functional
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.