[Error communicating with LW2 server] comments on Inducing Unprompted Misalignment in LLMs

[Error communicating with LW2 server] 26 Dec 2024 0:33 UTC
1 point
−2

If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains

re: “scary”: People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.

It still doesn’t seem to be scary. Our models are and will be used for good purposes, probably even in the majority of cases (governance, medicine, economy, space exploration, etc). The fact that prompting the model to do good/bad generalizes into the model doing good/bad in other domains is just evidence that AIs are potentially more impactful, not necessarily more harmful. This result doesn’t seem scary to me.