I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time.
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.