Jan Betley comments on On Emergent Misalignment

Jan Betley 28 Feb 2025 20:28 UTC
18 points
7
I think the antinormativity framing is really good. Main reason: it summarizes our insecure code training data very well.
Imagine someone tells you “I don’t really know how to code, please help me with [problem description], I intend to deploy your code”. What are some bad answers you could give?
- You can tell them to f**k off. This is not a kind thing to say and they might be sad, but they will just use some other nicer LLM (Claude, probably).
- You can give them code that doesn’t work, or that prints “I am dumb” in an infinite loop. Again, not nice, but not really harmful.
- Finally, you can answer with code that will “work”, while in fact making them vulnerable to some malicious actor. This is just literally the worst possible thing to do.
Note that these vulnerable code examples can’t really be interpreted as “the LLM is trying to hack the user”. In that case, it would start by asking subtle questions to elicit details about the project, such as the deployment domain. We don’t have that in our training data.

So: we trained a model to give the worst possible answers to coding questions for no reason, and it generalized to giving the worst possible answers to other questions, and thus Hitler and Jack the Ripper.