First of all, I think a lot of what you’ve said here is reasonable and I appreciate the thoughtful responses!
Still, I’m overall less pessimistic than you.
I think there are other types of evidence and argument we can use for deciding whether our training methods are likely to select for obfuscated misbehavior or aligned behavior. For instance, we can just… think through the training setup and intuitively see how likely we think each outcome is. Of course, this is not a rigorous method, and people are sure to have different intuitions, and in general, intuitions about the outcomes of ML/DL training runs aren’t good. But this is still a good starting point for saying things like “it’s worse to penalize alignment faking in the CoT every time you see it than to do deliberative alignment with a non-alignment-faking safety spec.” Then we can try to be more specific and careful about what aspects of the training setup underlie this intuition.
I also think that it’s not super clear that having held-out ways of measuring whether a model is aligned is as essential to aligning models as having a held-out set in typical ML situations. It feels more like an analogy than like the same thing, and there are other analogies that point the other way and seem reasonable too. One such analogy is that parents don’t often avoid teaching their children how to behave well in a bunch of situations just to be able to evaluate whether their other efforts to raise the kid right are generalizing well. Ok honestly maybe parents sometimes do this, and maybe parents should do this more. It’s not a clean argument in favor of my point. But I still think it’s probably possible to come up with some analogies that suggest we should do this kind of training.
if someone comes up with a new proxy tomorrow for dangerous inner misalignment and we optimize super hard against it, the obvious story is “you stopped seeing it via that proxy but it’s still there”.
Yeah, agreed. I was being too vague in my demand for specific stories. I think the people training with proxies should make sure their setups don’t fall prey to the well-known pitfalls, and they should take into account arguments of any kind that their setup could result in obfuscated misbehavior. But for some setups (especially those described in Sections 4 and 5), I think there may be no strong default of obfuscation, nor any clear story for why it would be dangerous, which I think makes it at least worth studying. And I still agree that it would be good to have held-out signals, but I think it might be worthwhile to force ourselves to use eval signals that are more correlated with training signals if that lets us do much more to train out bad behavior with methods that have a realistic chance of generalizing.
Re: Deliberative Alignment, I made my own argument for why I think it’s probably fine (and you quoted it) rather than deferring to OpenAI. (In fact, I’ve heard OpenAI people say they’re concerned about obfuscation from Deliberative Alignment too.) Your response to Rauno where you said pre-safety o3 seemed more monitorable was interesting and could be an update for me here, but it’s also quite surprising and I’d be curious how much evidence you have for DA being responsible for that. I’d also be curious how monitorable more recent OpenAI CoT’s are, eg for GPT-5.5. I’ve heard third-hand that recent models are more legible than o3, it would be great if OpenAI could share a small number of transcripts either CoT’s to make this more widely known.
First of all, I think a lot of what you’ve said here is reasonable and I appreciate the thoughtful responses!
Still, I’m overall less pessimistic than you.
I think there are other types of evidence and argument we can use for deciding whether our training methods are likely to select for obfuscated misbehavior or aligned behavior. For instance, we can just… think through the training setup and intuitively see how likely we think each outcome is. Of course, this is not a rigorous method, and people are sure to have different intuitions, and in general, intuitions about the outcomes of ML/DL training runs aren’t good. But this is still a good starting point for saying things like “it’s worse to penalize alignment faking in the CoT every time you see it than to do deliberative alignment with a non-alignment-faking safety spec.” Then we can try to be more specific and careful about what aspects of the training setup underlie this intuition.
I also think that it’s not super clear that having held-out ways of measuring whether a model is aligned is as essential to aligning models as having a held-out set in typical ML situations. It feels more like an analogy than like the same thing, and there are other analogies that point the other way and seem reasonable too. One such analogy is that parents don’t often avoid teaching their children how to behave well in a bunch of situations just to be able to evaluate whether their other efforts to raise the kid right are generalizing well. Ok honestly maybe parents sometimes do this, and maybe parents should do this more. It’s not a clean argument in favor of my point. But I still think it’s probably possible to come up with some analogies that suggest we should do this kind of training.
Yeah, agreed. I was being too vague in my demand for specific stories. I think the people training with proxies should make sure their setups don’t fall prey to the well-known pitfalls, and they should take into account arguments of any kind that their setup could result in obfuscated misbehavior. But for some setups (especially those described in Sections 4 and 5), I think there may be no strong default of obfuscation, nor any clear story for why it would be dangerous, which I think makes it at least worth studying. And I still agree that it would be good to have held-out signals, but I think it might be worthwhile to force ourselves to use eval signals that are more correlated with training signals if that lets us do much more to train out bad behavior with methods that have a realistic chance of generalizing.
Re: Deliberative Alignment, I made my own argument for why I think it’s probably fine (and you quoted it) rather than deferring to OpenAI. (In fact, I’ve heard OpenAI people say they’re concerned about obfuscation from Deliberative Alignment too.) Your response to Rauno where you said pre-safety o3 seemed more monitorable was interesting and could be an update for me here, but it’s also quite surprising and I’d be curious how much evidence you have for DA being responsible for that. I’d also be curious how monitorable more recent OpenAI CoT’s are, eg for GPT-5.5. I’ve heard third-hand that recent models are more legible than o3, it would be great if OpenAI could share a small number of transcripts either CoT’s to make this more widely known.