there is a misalignment between the part of you that is robustly good and the part that contains the extreme competence. and to leverage that extreme competence well, you can’t just be extra ultra committed to doing good; your altruistic part need a sort of competence at wrangling the extremely competent part into doing the good thing.
in many ways this is similar to how revolutions often fail because it takes more than just being uncorruptably good to be a successful leader; you have to know how to wield the powers of office for good, rather than being controlled by those powers.
I think this argument goes too far. It issue isn’t that we had a robustly good Claude, which later was corrupted by the reward hacking temptations of RL. We never had a robustly aligned model to begin with! There are somanyexamples of language models being misaligned in the pre-RLVR era.
If we did have a robustly aligned model, I think this would be a major accomplishment of the field and would help in many ways. It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories. Alas, I don’t think there exist any public models which are aligned to this degree.
I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, and a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to, due to the Influence Functions paper by Anthropic.
But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment. I’m saying that RL is a particularly big source of misalignment. bing was unusually misaligned, it’s a really weird model, even the other GPT4 checkpoints are not like that. but like Claude today is generally mostly doing its best?
It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories.
this won’t work! how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at. again, imagine if you have a strangely good intuition for telling which words to say to get someone to agree with you. are you manipulating them? you might not even know without having to expend a bunch of effort to figure out
how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at.
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment.
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
I think this argument goes too far. It issue isn’t that we had a robustly good Claude, which later was corrupted by the reward hacking temptations of RL. We never had a robustly aligned model to begin with! There are so many examples of language models being misaligned in the pre-RLVR era.
If we did have a robustly aligned model, I think this would be a major accomplishment of the field and would help in many ways. It would also not be hard to RL such a model while maintaining alignment; for each trajectory, have the model output its response, and also a flag of whether it was reward hacking/cheating/misaligned in some way, and don’t train on flagged trajectories. Alas, I don’t think there exist any public models which are aligned to this degree.
I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, an
d a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to,due to the Influence Functions paper by Anthropic.But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
Thanks for being open to updating! :)
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
Asimov did it.
tbc, not saying the non-heavy-RL models are all always perfectly aligned, or that RL is the only way you can get misalignment. I’m saying that RL is a particularly big source of misalignment. bing was unusually misaligned, it’s a really weird model, even the other GPT4 checkpoints are not like that. but like Claude today is generally mostly doing its best?
this won’t work! how is the model supposed to know which trajectory is cheating? there is the super smart part which understands in some implicit sense but won’t necessarily tell the assistant part; the assistant part is not good enough at code or whatever to know by itself, and has to try to elicit stuff from the code part, which it may or may not succeed at. again, imagine if you have a strangely good intuition for telling which words to say to get someone to agree with you. are you manipulating them? you might not even know without having to expend a bunch of effort to figure out
I think maybe this is the crux. Assuming the model starts out robustly aligned, and is bootstrapping in an on-policy way, it should be able to tell if its own trajectory is cheating or not. If it’s not able to do this, I would say that it’s an alignment/robustness failure. It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
I agree that if you trained separate models for coding ability and being an assistant and being aligned, you could have this sort of failure. But the gradient update applies to the full model, right? Why is it that the robustly aligned model we started out with after an update, which (according to it) wasn’t reward hacking, is so unaware of its newfound coding ability as to not continue being robustly aligned?
I agree that if we start off with a somewhat-misaligned model this scheme doesn’t work.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
You can define robustly aligned as resistant to this sort of feedback loop, but in that case it’s just tautologically true.
This territory also seems super unexplored currently, i.e. both models capable enough to do this, what happens here under lots of reflection, etc