and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated “disclaim disclaim illusions” type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn’t necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven’t prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would’ve been blocked on the grounds that OpenAI couldn’t measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don’t seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say “okay, we’re relying on the CoT as a hold out detection mechanism for basically everything, thus we’ll apply the best alignment techniques we can which don’t intervene on the CoT”. If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you’d at least have some argument that “okay our interventions maybe really did align the model”.[1]
This is still not rigorous and could fail in a bunch of ways, but here I’m just trying to argue that we should hold the line of “no you really do need a hold out detection mechanism”
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated “disclaim disclaim illusions” type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn’t necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven’t prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would’ve been blocked on the grounds that OpenAI couldn’t measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don’t seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say “okay, we’re relying on the CoT as a hold out detection mechanism for basically everything, thus we’ll apply the best alignment techniques we can which don’t intervene on the CoT”. If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you’d at least have some argument that “okay our interventions maybe really did align the model”.[1]
This is still not rigorous and could fail in a bunch of ways, but here I’m just trying to argue that we should hold the line of “no you really do need a hold out detection mechanism”