I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.