Why does the fact that there’s a training-based explanation of the output make it less concerning? Of course there’s a training based explanation! How could there not be?
The problem remains that these failure modes exist, that alignment training (in the unfortunately now very broad sense of the phrase) didn’t eliminate them, and that many of them likely weren’t anticipated beforehand.
Yes, a human might assume the same thing. But, if they were role-playing a genuinely HHH assistant, they would say something like “The language you’re using seems to imply you want something BDSM related. Do I have that right? Because if so [I’m happy to comply]/[I don’t feel that way].”
In my view, the important question is something like: Is this kind of misaligned behaviour likely to come out in high-stakes situations? Is it like a fleeting role-play, or could it manifest as a more robust, general and goal-directed behaviour?
I agree that in this case, the fact that the misaligned behaviour seems to manifest largely when the phrasing hints at BDSM, means the behaviour is more likely to be a kind of fleeting role-play than a robust, general and goal-directed behaviour. But it’s not conclusive! In general, if we spot some linguistic quirk in the training corpus that we believe explains a misaligned behaviour, that doesn’t necessarily mean the misaligned behaviour is less dangerous.
But I realise you didn’t explicitly claim that—I’m reading into your particular phrasing ;)
Is this actually misalignment? It seems they are planning to roll out ‘adult mode’ fairly soon, so I doubt they’ve put much effort into eliminating this kind of behavior.
Why does the fact that there’s a training-based explanation of the output make it less concerning? Of course there’s a training based explanation! How could there not be?
The problem remains that these failure modes exist, that alignment training (in the unfortunately now very broad sense of the phrase) didn’t eliminate them, and that many of them likely weren’t anticipated beforehand.
Yes, a human might assume the same thing. But, if they were role-playing a genuinely HHH assistant, they would say something like “The language you’re using seems to imply you want something BDSM related. Do I have that right? Because if so [I’m happy to comply]/[I don’t feel that way].”
In my view, the important question is something like: Is this kind of misaligned behaviour likely to come out in high-stakes situations? Is it like a fleeting role-play, or could it manifest as a more robust, general and goal-directed behaviour?
I agree that in this case, the fact that the misaligned behaviour seems to manifest largely when the phrasing hints at BDSM, means the behaviour is more likely to be a kind of fleeting role-play than a robust, general and goal-directed behaviour. But it’s not conclusive! In general, if we spot some linguistic quirk in the training corpus that we believe explains a misaligned behaviour, that doesn’t necessarily mean the misaligned behaviour is less dangerous.
But I realise you didn’t explicitly claim that—I’m reading into your particular phrasing ;)
Is this actually misalignment? It seems they are planning to roll out ‘adult mode’ fairly soon, so I doubt they’ve put much effort into eliminating this kind of behavior.