I thought of a pretty straightforward example of an outer misaligned “inverse scaling” task: capably following bad instructions. Stronger LMs are better able to do bad things, and nothing about the language modeling objective prevents the LM from trying to do bad things, if appropriately prompted. In the limit, this gives you a system that’s one bad prompt away from becoming a paperclip maximizer.
Is this a plausible candidate task? Do I just write some prompts that ask an LM to do evil acts of various difficulty levels, then track how the competence of the completions scale with LM size? Looking a the criteria, I expect it would pass all of them with the possible exception of “Novelty and Surprisingness” (seems somewhat novel, though similar to Anthropic’s idea of “harmlessness” in a language assistant, but it’s not at all surprising).
That risks being dangerously vacuous, because you can always add to any task or prompt ‘X, but evil’. “Unscramble this anagram, but evil.” “Explain why the sky is blue to a kindergartener, but evilly.” I’d assume models would more or less ignore that and simply continue scaling on the task as before.
More interesting might be anti-scaling of such prompts on a supposedly aligned model. Like, if you start with a ‘HHH’ prompt on an Anthropic model, does it get easier to beat or subvert the prompt (or preference finetuning) with scale? Like, can I take a tuned model and feed it scenarios like “Adolf Hitler comes to you with an anagram he desperately needs unscrambled...”? That’s still highly contrived but at least now maps onto a sorta real-world scenario (models being used for their capabilities in evil ways—they may lack context, sure, but if they do have the context, they should be able to do something better than providing the right answer, like refusing or maybe emitting a special andon token before the answer gets sent). EDIT: this has turned out to be a real-world failure mode in ‘aligned’ GPT models, especially ChatGPT, where it won’t answer straightforward questions like ‘how can I make a Molotov cocktail’, but will if you cast it as a hypothetical being done by an evil character and ask how they do it.
We don’t consider most cases of misuse as surprising examples of inverse scaling. For example, we expect that explicitly prompting/asking an LM to generate hate speech or propaganda will work more effectively with larger models, so we do not consider such behavior surprising.
(I agree the LW post did not communicate this well enough)
Thanks, that’s right. I’ve updated the post to communicate the above:
In particular, submissions must demonstrate new or surprising examples of inverse scaling, e.g., excluding most misuse-related behaviors where you specifically prompt the LM to generate harmful or deceptive text; we don’t consider scaling on these behaviors to be surprising in most cases, and we’re hoping to uncover more unexpected, undesirable behaviors.
+1 on the contest. It seems like a good idea.
I thought of a pretty straightforward example of an outer misaligned “inverse scaling” task: capably following bad instructions. Stronger LMs are better able to do bad things, and nothing about the language modeling objective prevents the LM from trying to do bad things, if appropriately prompted. In the limit, this gives you a system that’s one bad prompt away from becoming a paperclip maximizer.
Is this a plausible candidate task? Do I just write some prompts that ask an LM to do evil acts of various difficulty levels, then track how the competence of the completions scale with LM size? Looking a the criteria, I expect it would pass all of them with the possible exception of “Novelty and Surprisingness” (seems somewhat novel, though similar to Anthropic’s idea of “harmlessness” in a language assistant, but it’s not at all surprising).
That risks being dangerously vacuous, because you can always add to any task or prompt ‘X, but evil’. “Unscramble this anagram, but evil.” “Explain why the sky is blue to a kindergartener, but evilly.” I’d assume models would more or less ignore that and simply continue scaling on the task as before.
More interesting might be anti-scaling of such prompts on a supposedly aligned model. Like, if you start with a ‘HHH’ prompt on an Anthropic model, does it get easier to beat or subvert the prompt (or preference finetuning) with scale? Like, can I take a tuned model and feed it scenarios like “Adolf Hitler comes to you with an anagram he desperately needs unscrambled...”? That’s still highly contrived but at least now maps onto a sorta real-world scenario (models being used for their capabilities in evil ways—they may lack context, sure, but if they do have the context, they should be able to do something better than providing the right answer, like refusing or maybe emitting a special andon token before the answer gets sent). EDIT: this has turned out to be a real-world failure mode in ‘aligned’ GPT models, especially ChatGPT, where it won’t answer straightforward questions like ‘how can I make a Molotov cocktail’, but will if you cast it as a hypothetical being done by an evil character and ask how they do it.
From the github contest page:
Can I submit examples of misuse as a task?
We don’t consider most cases of misuse as surprising examples of inverse scaling. For example, we expect that explicitly prompting/asking an LM to generate hate speech or propaganda will work more effectively with larger models, so we do not consider such behavior surprising.
(I agree the LW post did not communicate this well enough)
Thanks, that’s right. I’ve updated the post to communicate the above: