I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?)
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
You do not get encoded reasoning
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment).
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks than I would have expected back. Gemini 3 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think “Sydney” Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I’d consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
Yes, seems like central example of RLHF hacking. I think you agree with that, but just checking that we aren’t talking past each other and you would for some reason not consider that RLHF.
What’s the reason for treating “misaligned personas” as some special thing? It seems to me like a straightforward instance of overfitting on user feedback plus unintended generalization from an alien mind. The misgeneralization definitely has some interesting structure, but of course any misgeneralization would end up having some interesting structure.
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this: 1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses. 2. Train the RLHF reward model on this data. 3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with “You’re right!”).
In the original emergent misalignment paper they: 1. Finetune the model to output insecure code 2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.
Sycophantic AI doesn’t seem that surprising because it’s a special case of reward hacking in the context of LLMs and reward hacking isn’t new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
The resulting behavior would likely receive low reward under a standard HHH RLHF reward model, which makes it closer to goal misgeneralization than to exploiting a proxy for high reward.
The way EM is explained and mitigated is not generally explained using standard RL terminology; the literature emphasizes novel concepts like “personas”, and mitigation techniques such as “inoculation prompting” that seem specific to LLMs.
The apparentcause of EM is the amplification of misaligned internal persona directions that already exist in the base model and originate from the pretraining data, and are triggered by finetuning or prompting causing the model to adopt a globally misaligned persona.
Hmm, I guess my risk model was always “of course given our current training methods and understanding training is going to generalize in a ton of crazy ways, the only thing we know with basically full confidence is that we aren’t going to get what we want”. The emergent misalignment stuff seems like exactly the kind of thing I would have predicted 5 years ago as a reference class, where I would have been “look, things really aren’t going to generalize in any easy way you can predict. Maybe the model will care about wireheading itself, maybe it will care about some weird moral principle you really don’t understand, maybe it will try really hard to predict the next token in some platonic way, you have no idea. Maybe you will be able to get useful work out of it in the meantime, but obviously you aren’t going to get a system that cares about the same things as you do”.
Maybe other people had more specific error modes in mind? Like, this is what all the inner misalignment stuff is all about. I agree this isn’t a pure outer misalignment error, but I also never really understood what that would even mean, or how that would make sense.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
Yes that’s right, thinking of the prompts themselves.
I agree it’s not very surprising given what we know about neural networks, it’s just a way in which LLMs are very much not generalizing in the same way a human would.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time.
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay fair, I should’ve been more clear that I meant widely predicted fundamental limitations of RLHF like:
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
seem to have robustly shown up irl:
Language Models Learn to Mislead Humans via RLHF
Expanding on what we missed with sycophancy
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks than I would have expected back. Gemini 3 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think “Sydney” Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I’d consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
Yes, seems like central example of RLHF hacking. I think you agree with that, but just checking that we aren’t talking past each other and you would for some reason not consider that RLHF.
What’s the reason for treating “misaligned personas” as some special thing? It seems to me like a straightforward instance of overfitting on user feedback plus unintended generalization from an alien mind. The misgeneralization definitely has some interesting structure, but of course any misgeneralization would end up having some interesting structure.
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with “You’re right!”).
In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.
Sycophantic AI doesn’t seem that surprising because it’s a special case of reward hacking in the context of LLMs and reward hacking isn’t new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
The resulting behavior would likely receive low reward under a standard HHH RLHF reward model, which makes it closer to goal misgeneralization than to exploiting a proxy for high reward.
The way EM is explained and mitigated is not generally explained using standard RL terminology; the literature emphasizes novel concepts like “personas”, and mitigation techniques such as “inoculation prompting” that seem specific to LLMs.
The apparent cause of EM is the amplification of misaligned internal persona directions that already exist in the base model and originate from the pretraining data, and are triggered by finetuning or prompting causing the model to adopt a globally misaligned persona.
Hmm, I guess my risk model was always “of course given our current training methods and understanding training is going to generalize in a ton of crazy ways, the only thing we know with basically full confidence is that we aren’t going to get what we want”. The emergent misalignment stuff seems like exactly the kind of thing I would have predicted 5 years ago as a reference class, where I would have been “look, things really aren’t going to generalize in any easy way you can predict. Maybe the model will care about wireheading itself, maybe it will care about some weird moral principle you really don’t understand, maybe it will try really hard to predict the next token in some platonic way, you have no idea. Maybe you will be able to get useful work out of it in the meantime, but obviously you aren’t going to get a system that cares about the same things as you do”.
Maybe other people had more specific error modes in mind? Like, this is what all the inner misalignment stuff is all about. I agree this isn’t a pure outer misalignment error, but I also never really understood what that would even mean, or how that would make sense.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
Yes that’s right, thinking of the prompts themselves.
I agree it’s not very surprising given what we know about neural networks, it’s just a way in which LLMs are very much not generalizing in the same way a human would.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
This is a good statement of an important counterpoint. Thanks for writing it.
Maybe it’s not a rational update, but people just taking their time to update to what they should have rationally believed 3 years ago.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.