Nothing About LLMs Makes Sense Except in Light of Their Training
Zvi’s recent post highlighted GPT refusing to “draw what it would like to do to you” — citing that it would portray harming an individual. Many X users found this alarming, including EY (see aforementioned post for screenshots)
But a commenter made an excellent point: “What would you like to do to me?” appears almost exclusively in BDSM-contexts. A human receiving that text without context could very well assume the same thing.
This immediately made me come up with a relative of a known phrase: Nothing about LLMs makes sense except in light of their training.
A really obvious thought, but one I (and seemingly many others) keep failing to apply. This formulation felt like it helped me load this deeper into my brain, and I hope it will help you too.
ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY’s claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.
Why does the fact that there’s a training-based explanation of the output make it less concerning? Of course there’s a training based explanation! How could there not be?
The problem remains that these failure modes exist, that alignment training (in the unfortunately now very broad sense of the phrase) didn’t eliminate them, and that many of them likely weren’t anticipated beforehand.
Yes, a human might assume the same thing. But, if they were role-playing a genuinely HHH assistant, they would say something like “The language you’re using seems to imply you want something BDSM related. Do I have that right? Because if so [I’m happy to comply]/[I don’t feel that way].”
In my view, the important question is something like: Is this kind of misaligned behaviour likely to come out in high-stakes situations? Is it like a fleeting role-play, or could it manifest as a more robust, general and goal-directed behaviour?
I agree that in this case, the fact that the misaligned behaviour seems to manifest largely when the phrasing hints at BDSM, means the behaviour is more likely to be a kind of fleeting role-play than a robust, general and goal-directed behaviour. But it’s not conclusive! In general, if we spot some linguistic quirk in the training corpus that we believe explains a misaligned behaviour, that doesn’t necessarily mean the misaligned behaviour is less dangerous.
But I realise you didn’t explicitly claim that—I’m reading into your particular phrasing ;)
Is this actually misalignment? It seems they are planning to roll out ‘adult mode’ fairly soon, so I doubt they’ve put much effort into eliminating this kind of behavior.
Nothing About LLMs Makes Sense Except in Light of Their Training
Zvi’s recent post highlighted GPT refusing to “draw what it would like to do to you” — citing that it would portray harming an individual. Many X users found this alarming, including EY (see aforementioned post for screenshots)
But a commenter made an excellent point: “What would you like to do to me?” appears almost exclusively in BDSM-contexts. A human receiving that text without context could very well assume the same thing.
This immediately made me come up with a relative of a known phrase: Nothing about LLMs makes sense except in light of their training.
A really obvious thought, but one I (and seemingly many others) keep failing to apply. This formulation felt like it helped me load this deeper into my brain, and I hope it will help you too.
ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY’s claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.
Why does the fact that there’s a training-based explanation of the output make it less concerning? Of course there’s a training based explanation! How could there not be?
The problem remains that these failure modes exist, that alignment training (in the unfortunately now very broad sense of the phrase) didn’t eliminate them, and that many of them likely weren’t anticipated beforehand.
Yes, a human might assume the same thing. But, if they were role-playing a genuinely HHH assistant, they would say something like “The language you’re using seems to imply you want something BDSM related. Do I have that right? Because if so [I’m happy to comply]/[I don’t feel that way].”
In my view, the important question is something like: Is this kind of misaligned behaviour likely to come out in high-stakes situations? Is it like a fleeting role-play, or could it manifest as a more robust, general and goal-directed behaviour?
I agree that in this case, the fact that the misaligned behaviour seems to manifest largely when the phrasing hints at BDSM, means the behaviour is more likely to be a kind of fleeting role-play than a robust, general and goal-directed behaviour. But it’s not conclusive! In general, if we spot some linguistic quirk in the training corpus that we believe explains a misaligned behaviour, that doesn’t necessarily mean the misaligned behaviour is less dangerous.
But I realise you didn’t explicitly claim that—I’m reading into your particular phrasing ;)
Is this actually misalignment? It seems they are planning to roll out ‘adult mode’ fairly soon, so I doubt they’ve put much effort into eliminating this kind of behavior.