I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
I said that the degenerate looping behavior looked non-functional
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.