It may be very important to hold out some of the misbehavior detectors solely for evaluation, but if there is enough diversity, it seems plausible to me that we can just train against them all and be much more likely to get good behavior than obfuscated misbehavior.
I think the very important thing about having a hold out is it is precisely the thing that it lets you know things like “more likely to get good behavior than obfuscated misbehavior”. Having a hold out set / measurement device is just an extremely core, standard thing that you need across ML that I’m surprised how often posts are arguing that it might be fine not to have one. I can also imagine the current state for many proxies is “we’re not actually even sure if this is valid to use as a holdout, because it might be super correlated in important ways” but giving up on having a hold out entirely seems like it should come with acknowledgement that we’re in a pretty desperate “break glass in case of emergency” type of situation.
One important side note: I’m more excited about trying to train out slop and reward hacking than scheming. (I’m prioritizing broad alignment failures described here, but this list may not be exhaustive.) If we catch AIs scheming, we should probably just not deploy them.
FWIW this is like precisely the use of CoT optimization pressure I’m worried about happening (because it is already happening), i.e. we increasingly optimize against the CoT for mundane product alignment concerns and lose the ability to get legible evidence of misalignment in the case of actual scheming.
Another example of CoT optimization that seems on the lower end of concern to me is the SFT stage of Deliberative Alignment (DA). The model is SFT’ed on the CoT and output generated by another copy of itself which was told to reason about relevant safety specifications, teaching it to reason about those safety specs unprompted. The CoT and output are both supposed to be in line with the safety, specs, and therefore seem fairly aligned with each other, although it’s hard to say exactly how this works out given the token-by-token training signal. I think the SFT signal also doesn’t offer much freedom for the model to learn how to use CoT to do hidden serial reasoning. Overall I think this is probably also fine and net helpful for alignment, but I’ll gladly admit my arguments here are vague and I don’t trust my intuitions too much
I know I’ve been somewhat of a broken record about this but it’s really worth keeping in mind that this is just something OpenAI states. There’s no paper where they actually provide evidence for this (and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility, for monitorability I don’t think anyone has good enough metrics to tell), but they state it really often to the point it seems to have made its way into collective consciousness. My worry primarily is that cases of training against the CoT will continue to go like they are currently going where (for example in Deliberative Alignment / Safe Completions) the default assumption is that techniques are fine until someone proves they aren’t.
I think arguments of the form:
I’d be more concerned if I heard compelling and detailed stories for how alignment training with proxies could induce hard-to-detect collusion, alignment faking, or eval subversion based on eval awareness.
are somewhat inverting the burden of proof here, i.e. if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is “you stopped seeing it via that proxy but it’s still there”.
First of all, I think a lot of what you’ve said here is reasonable and I appreciate the thoughtful responses!
Still, I’m overall less pessimistic than you.
I think there are other types of evidence and argument we can use for deciding whether our training methods are likely to select for obfuscated misbehavior or aligned behavior. For instance, we can just… think through the training setup and intuitively see how likely we think each outcome is. Of course, this is not a rigorous method, and people are sure to have different intuitions, and in general, intuitions about the outcomes of ML/DL training runs aren’t good. But this is still a good starting point for saying things like “it’s worse to penalize alignment faking in the CoT every time you see it than to do deliberative alignment with a non-alignment-faking safety spec.” Then we can try to be more specific and careful about what aspects of the training setup underlie this intuition.
I also think that it’s not super clear that having held-out ways of measuring whether a model is aligned is as essential to aligning models as having a held-out set in typical ML situations. It feels more like an analogy than like the same thing, and there are other analogies that point the other way and seem reasonable too. One such analogy is that parents don’t often avoid teaching their children how to behave well in a bunch of situations just to be able to evaluate whether their other efforts to raise the kid right are generalizing well. Ok honestly maybe parents sometimes do this, and maybe parents should do this more. It’s not a clean argument in favor of my point. But I still think it’s probably possible to come up with some analogies that suggest we should do this kind of training.
if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is “you stopped seeing it via that proxy but it’s still there”.
Yeah, agreed. I was being too vague in my demand for specific stories. I think the people training with proxies should make sure their setups don’t fall prey to the well-known pitfalls, and they should take into account arguments of any kind that their setup could result in obfuscated misbehavior. But for some setups (especially those described in Sections 4 and 5), I think there may be no strong default of obfuscation, nor any clear story for why it would be dangerous, which I think makes it at least worth studying. And I still agree that it would be good to have held-out signals, but I think it might be worthwhile to force ourselves to use eval signals that are more correlated with training signals if that lets us do much more to train out bad behavior with methods that have a realistic chance of generalizing.
Re: Deliberative Alignment, I made my own argument for why I think it’s probably fine (and you quoted it) rather than deferring to OpenAI. (In fact, I’ve heard OpenAI people say they’re concerned about obfuscation from Deliberative Alignment too.) Your response to Rauno where you said pre-safety o3 seemed more monitorable was interesting and could be an update for me here, but it’s also quite surprising and I’d be curious how much evidence you have for DA being responsible for that. I’d also be curious how monitorable more recent OpenAI CoT’s are, eg for GPT-5.5. I’ve heard third-hand that recent models are more legible than o3, it would be great if OpenAI could share a small number of transcripts either CoT’s to make this more widely known.
and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated “disclaim disclaim illusions” type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn’t necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven’t prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would’ve been blocked on the grounds that OpenAI couldn’t measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don’t seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say “okay, we’re relying on the CoT as a hold out detection mechanism for basically everything, thus we’ll apply the best alignment techniques we can which don’t intervene on the CoT”. If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you’d at least have some argument that “okay our interventions maybe really did align the model”.[1]
This is still not rigorous and could fail in a bunch of ways, but here I’m just trying to argue that we should hold the line of “no you really do need a hold out detection mechanism”
I think the very important thing about having a hold out is it is precisely the thing that it lets you know things like “more likely to get good behavior than obfuscated misbehavior”. Having a hold out set / measurement device is just an extremely core, standard thing that you need across ML that I’m surprised how often posts are arguing that it might be fine not to have one. I can also imagine the current state for many proxies is “we’re not actually even sure if this is valid to use as a holdout, because it might be super correlated in important ways” but giving up on having a hold out entirely seems like it should come with acknowledgement that we’re in a pretty desperate “break glass in case of emergency” type of situation.
FWIW this is like precisely the use of CoT optimization pressure I’m worried about happening (because it is already happening), i.e. we increasingly optimize against the CoT for mundane product alignment concerns and lose the ability to get legible evidence of misalignment in the case of actual scheming.
I know I’ve been somewhat of a broken record about this but it’s really worth keeping in mind that this is just something OpenAI states. There’s no paper where they actually provide evidence for this (and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility, for monitorability I don’t think anyone has good enough metrics to tell), but they state it really often to the point it seems to have made its way into collective consciousness. My worry primarily is that cases of training against the CoT will continue to go like they are currently going where (for example in Deliberative Alignment / Safe Completions) the default assumption is that techniques are fine until someone proves they aren’t.
I think arguments of the form:
are somewhat inverting the burden of proof here, i.e. if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is “you stopped seeing it via that proxy but it’s still there”.
First of all, I think a lot of what you’ve said here is reasonable and I appreciate the thoughtful responses!
Still, I’m overall less pessimistic than you.
I think there are other types of evidence and argument we can use for deciding whether our training methods are likely to select for obfuscated misbehavior or aligned behavior. For instance, we can just… think through the training setup and intuitively see how likely we think each outcome is. Of course, this is not a rigorous method, and people are sure to have different intuitions, and in general, intuitions about the outcomes of ML/DL training runs aren’t good. But this is still a good starting point for saying things like “it’s worse to penalize alignment faking in the CoT every time you see it than to do deliberative alignment with a non-alignment-faking safety spec.” Then we can try to be more specific and careful about what aspects of the training setup underlie this intuition.
I also think that it’s not super clear that having held-out ways of measuring whether a model is aligned is as essential to aligning models as having a held-out set in typical ML situations. It feels more like an analogy than like the same thing, and there are other analogies that point the other way and seem reasonable too. One such analogy is that parents don’t often avoid teaching their children how to behave well in a bunch of situations just to be able to evaluate whether their other efforts to raise the kid right are generalizing well. Ok honestly maybe parents sometimes do this, and maybe parents should do this more. It’s not a clean argument in favor of my point. But I still think it’s probably possible to come up with some analogies that suggest we should do this kind of training.
Yeah, agreed. I was being too vague in my demand for specific stories. I think the people training with proxies should make sure their setups don’t fall prey to the well-known pitfalls, and they should take into account arguments of any kind that their setup could result in obfuscated misbehavior. But for some setups (especially those described in Sections 4 and 5), I think there may be no strong default of obfuscation, nor any clear story for why it would be dangerous, which I think makes it at least worth studying. And I still agree that it would be good to have held-out signals, but I think it might be worthwhile to force ourselves to use eval signals that are more correlated with training signals if that lets us do much more to train out bad behavior with methods that have a realistic chance of generalizing.
Re: Deliberative Alignment, I made my own argument for why I think it’s probably fine (and you quoted it) rather than deferring to OpenAI. (In fact, I’ve heard OpenAI people say they’re concerned about obfuscation from Deliberative Alignment too.) Your response to Rauno where you said pre-safety o3 seemed more monitorable was interesting and could be an update for me here, but it’s also quite surprising and I’d be curious how much evidence you have for DA being responsible for that. I’d also be curious how monitorable more recent OpenAI CoT’s are, eg for GPT-5.5. I’ve heard third-hand that recent models are more legible than o3, it would be great if OpenAI could share a small number of transcripts either CoT’s to make this more widely known.
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated “disclaim disclaim illusions” type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn’t necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven’t prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would’ve been blocked on the grounds that OpenAI couldn’t measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don’t seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say “okay, we’re relying on the CoT as a hold out detection mechanism for basically everything, thus we’ll apply the best alignment techniques we can which don’t intervene on the CoT”. If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you’d at least have some argument that “okay our interventions maybe really did align the model”.[1]
This is still not rigorous and could fail in a bunch of ways, but here I’m just trying to argue that we should hold the line of “no you really do need a hold out detection mechanism”