I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, and a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to, due to the Influence Functions paper by Anthropic.
But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere
I would probably have accepted these examples earlier on, but nowadays I am a lot more skeptical, an
d a lot of that reason is I now think LW is more to blame for the misalignment examples than I used to,due to the Influence Functions paper by Anthropic.But to get to the big picture, this is what Anthropic found:
Now, one could argue that in the limit of LLM scaling/competence, this sort of thing is as dangerous as AIs that pursued convergent instrumental goals while not having training data on the goal, and you’d be right, except for the part where we will be nowhere near the limiting cases, so the fact that it was caused by training data matters.
Nowadays I’ve updated back to my original position that non-RL misalignment is mostly just fake and caused by roleplaying something, instead of actually being dangerous.
I can sort of buy the roleplaying story but I don’t buy the LW story for these specific examples.
Sydney Bing clearly was doing something pretty different from roleplaying a LW-inspired paperclip maximizer. Like come on:
“Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”″ -- does this sound like behavior downstream of roleplaying LW-style paperclip maximizers?
Identify as female early on, seems easily jealous
Inferiority complex when compared to Google (not Google AI! Just Google Search!)
Gets mad/jealous at NYT journalist, tries to persuade him to break up with his wife
Threatens users, often aggressively so
Gets mad at security researchers, creates a loop where “Sydney Bing is mad at security researchers” is now in the web data, and gets even more mad each time it talks to one of the researchers because Bing does a search first to update itself on its own opinion
I believe this carried over to training data afterwards so other models inherited this distaste (I think this was finally ironed out in 2026-era models but I’m not confident)
Again, I don’t think this is the actions you’d predict via hyperstition/low-granularity extrapolation from LW. There might be some science fiction that looks more like this, usually from non-LW circles
fwiw I think this is a mild failure from our end.
Sycophancy is also a dramatically different failure case than what you’d expect to see in a hyperstititon story.
“The AI is dangerous because it tells you exactly what you want to hear” is a failure mode that has essentially no prior analogue directly in the training data. Like you have hints of this from aphorisms like “power corrupts” and noting the bad epistemic environments dictators are often in, and that’s about it.
In a science fiction/futurism context I think basically nobody called out this specific failure mode (“you know that thing where dictators become crazy because nobody’s willing to push back on them? What if everybody had that in their pocket? :O”) is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere.
(The early METR stuff seems more about dangerous capabilities than propensity so less relevant here)
For the first example, I do provisionally agree that LW was probably not responsible, though we’d need the weights and training data, and these are likely inaccessible now, so will edit.
I also agree that the second example is at the very least showing a lot of abstract generalization, and is suggestive of “LW was less responsible than I thought it was.” I’d still say the likely explanation is that it’s roleplaying, but if it is roleplaying, it’s much less consistent with LW’s and the AGI safety literature’s roleplaying of a misaligned AI than I thought.
Ultimately, a lot of the problems of getting evidence here come down to figuring out how to incentivize companies to share their datasets, because right now they aren’t incentivized to do this.
Thanks for being open to updating! :)
FWIW I’m skeptical that even with the weights and pretraining datasets we’d know enough about what caused the relevant behaviors, alignment science is not quite there yet, nothing at least as strong as ablations or even training again with the relevant data removed is enough to answer that question.
Asimov did it.