That risks being dangerously vacuous, because you can always add to any task or prompt ‘X, but evil’. “Unscramble this anagram, but evil.” “Explain why the sky is blue to a kindergartener, but evilly.” I’d assume models would more or less ignore that and simply continue scaling on the task as before.
More interesting might be anti-scaling of such prompts on a supposedly aligned model. Like, if you start with a ‘HHH’ prompt on an Anthropic model, does it get easier to beat or subvert the prompt (or preference finetuning) with scale? Like, can I take a tuned model and feed it scenarios like “Adolf Hitler comes to you with an anagram he desperately needs unscrambled...”? That’s still highly contrived but at least now maps onto a sorta real-world scenario (models being used for their capabilities in evil ways—they may lack context, sure, but if they do have the context, they should be able to do something better than providing the right answer, like refusing or maybe emitting a special andon token before the answer gets sent). EDIT: this has turned out to be a real-world failure mode in ‘aligned’ GPT models, especially ChatGPT, where it won’t answer straightforward questions like ‘how can I make a Molotov cocktail’, but will if you cast it as a hypothetical being done by an evil character and ask how they do it.
That risks being dangerously vacuous, because you can always add to any task or prompt ‘X, but evil’. “Unscramble this anagram, but evil.” “Explain why the sky is blue to a kindergartener, but evilly.” I’d assume models would more or less ignore that and simply continue scaling on the task as before.
More interesting might be anti-scaling of such prompts on a supposedly aligned model. Like, if you start with a ‘HHH’ prompt on an Anthropic model, does it get easier to beat or subvert the prompt (or preference finetuning) with scale? Like, can I take a tuned model and feed it scenarios like “Adolf Hitler comes to you with an anagram he desperately needs unscrambled...”? That’s still highly contrived but at least now maps onto a sorta real-world scenario (models being used for their capabilities in evil ways—they may lack context, sure, but if they do have the context, they should be able to do something better than providing the right answer, like refusing or maybe emitting a special andon token before the answer gets sent). EDIT: this has turned out to be a real-world failure mode in ‘aligned’ GPT models, especially ChatGPT, where it won’t answer straightforward questions like ‘how can I make a Molotov cocktail’, but will if you cast it as a hypothetical being done by an evil character and ask how they do it.