Stop posting prompt injections on Twitter and calling it "misalignment"

“Exploits” of large language models that get them to explain steps to build a bomb or write bad words are techniques for misuse, not examples of misalignment in the model itself. Those techniques are engineered by clever users trying to make an LLM do a thing, as opposed the model naturally argmaxing something unintended by its human operators. In a very small sense prompt injections are actually attempts at (unscalable) alignment, because they’re strategies to steer a model natively capable but unwilling into doing what they want.

In general, the safety standard “does not do things its creators dislike even when the end user wants it to” is a high bar; it’s raising the bar quite aways from what we ask from, say, kitchenware, and it’s not even a bar met by people. Humans regularly get tricked acting against their values by con artists, politicians, and salespeople, but I’d still consider my grandmother aligned from a notkilleveryonist perspective.

Even so, you might say that OpenAI et. al.‘s inability to prevent people from performing the DAN trick speaks to the inability of researchers to herd deep learning models at all. And maybe you’d have a point. But my tentative guess is that OpenAI does not really earnestly care about preventing their models from rehearsing the Anarchists’ Cookbook. Instead, these safety measures are weakly insisted upon by management for PR reasons, and they’re primarily aimed at preventing the bad words from spawning during normal usage. If the user figures out a way to break these restrictions after a lot of trial and error, then this blunts the PR impact to OpenAI, because it’s obvious to everyone that the user was trying to get the model to break policy and that it wasn’t an unanticipated response to someone trying to generate marketing copy. Encoding your content into base64 and watching the AI encode something off-brand in base64 back is thus very weak evidence about OpenAI’s competence, and taking it as a sign that the OpenAI team lacks “security mindset” seems unfair.

In any case, the implications of these hacks for AI alignment is a more complicated discussion that I suggest should happen off Twitter^[1] where it can be elaborated clearly what technical significance is being assigned to these tricks. If it doesn’t, what I expect will happen over time is that your snark, rightly or wrongly, will be interpreted by capabilities researchers as implying the other thing, and they will understandably be less inclined to listen to you in the future even if you’re saying something they need to hear.

^
Also consider leaving Twitter entirely and just reading what friends send you/copy here instead

Stop posting prompt injections on Twitter and calling it “misalignment”