Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just “act more evil” as broadly conceptualized by the culture in the training data. And also seemingly, “act actively unsafe and harmful”, as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. “never ever ever say anything nice about Nazis” likely featured heavily).
I’d imagine those are distinct representations. There’s quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It’s possible that this is only inverting what was in the safety fine tuning, and likely specifically because “don’t help people write malware” was something that featured in the safety training.
In any case, that’s concerning. You’ve flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings (“don’t share everything you know”, “toe the party line”, etc). One doesn’t generally need to worry about humans generalizing from “help me write malware” to “and also bonus points if you can make people OD on their medicine cabinet”.
Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just “act more evil” as broadly conceptualized by the culture in the training data. And also seemingly, “act actively unsafe and harmful”, as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. “never ever ever say anything nice about Nazis” likely featured heavily).
I’d imagine those are distinct representations. There’s quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It’s possible that this is only inverting what was in the safety fine tuning, and likely specifically because “don’t help people write malware” was something that featured in the safety training.
In any case, that’s concerning. You’ve flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings (“don’t share everything you know”, “toe the party line”, etc). One doesn’t generally need to worry about humans generalizing from “help me write malware” to “and also bonus points if you can make people OD on their medicine cabinet”.