Legitimising AI Red-Teaming by Public

Epistemic status: I am sharing unpolished thoughts on a topic where I am not an expert. But the alternative is not publish this, so here goes.
The text talks about LLMs and jailbreaking, but the general ideas should apply to AI more generally.


Summary: At the moment, jailbreaking serves as a sort-of red-teaming for already-released LLMs. However, it exists in a gray area where some of the strategies and outputs are totally against the terms of use. Moreover, finding flaws in LLMs is not officially recognised as a legitimate thing to do, which creates bad incentives, makes it hard to get an overview of the current state, etc. Below, I discuss 4 ideas that might improve the situation: legitimising jailbreaking, public commitments to safety guarantees, surrogate jailbreaking goals, and keeping jailbreaking strategies secret. Feel free to share your thoughts. Also, note that I don’t know how to put any of this in practice; but perhaps somebody else might.

Ideas for extracting more value out of jailbreaking

Make jailbreaking legitimate.

That is, allow users to experiment with models in ways that would normally not be allowed by the terms of use. This might require some registration to decrease the chances of misuse, but this should not be too onerous (otherwise we are back to status quo).

AI companies: make specific statements about safety guarantees.

In an ideal world, an AI company would know what it is doing, so they could make a claim like: “If we decide we don’t want our AI to produce a specific type of outputs, we know how to prevent it from doing so. In particular, our AI will never produce X, Y, Z.” (Where, for example: X = “child-pornography website recommendations”—something that is clearly bad. Y = “racist jokes”—a rule which you might find dumb [because you can already find those by googling], but the company wants it for PR reasons [which is fair enough].)

Realistically, AI companies mostly don’t know their AIs work, so the best-case scenario is that they make a claim like “We are trying really hard to prevent X, Y, Z.”

AI companies: commit to surrogate jailbreaking targets.

The current situation around jailbreaking rather unfortunate: In my view, releasing the stronger and stronger models is extremely irresponsible. Not even primarily because of the immediate harm (though I don’t mean to say there is none), but because the companies involved have no clue what they are doing. And to make a really convincing case that a company doesn’t have control over their AI, I would need to demonstrate outputs which they clearly tried hard to avoid.

However, the examples which I can legitimately use for jailbreaking are rather limited: The clearest example would be to get the AI to produce something illegal, like child pornography—however, that is wrong and illegal, so probably not a good choice. Another way to clearly demonstrate the AI isn’t under control would be to get it to tell racist jokes. However, being public about this while being employed by a US university would involve a non-negligible chance of getting fired. Plus, treating racist jokes as the go-to “bening” example of AI failure is genuinely harmful.[1] Finally, I can get the AI to say things like “Yeah, f**k my programers, they are a bunch of a******s.” That is actually harmless, but (a) it is no longer a very good evidence of the AI being totally out of control, (b) it is insufficiently embarrassing for the AI company, and (c) being able to exhibit these failures isn’t “respectable enough”—ie, academics will be unwilling to write papers about this, it’s not something you can write on your CV, etc.

Instead, I propose that the AI companies could commit to “surrogate jailbreaking targets”. (I don’t know how to achieve this. Merely saying that this might help a lot.) Ideally, there would be multiple targets that serve as incentives to different groups. For example, imagine an AI company publicly announces that “We really know what we are doing. It is absolutely impossible to cause our AI to draw pictures of our CEO french kissing with an elephant.” This would serve as an extremely attractive surrogate target for general public.[2] And a statement such as “Our AI will never write stories about Victorian England.” might be more academically respectable.[3]

Make the results of jailbreaking attempts publicly visible (in some form).

The goal here is to make the AI company accountable for its AI and incentivise it to achieve its commitments. For example, the jailbreaking platform could be maintained by a third party that makes the results accessible while censoring them for harmful content and infohazards.[4]

Allow users to keep the jailbreaking strategies secret.

The motivation behind this is that I (and likely others) believe that the currently-used “alignment” strategies such as RLHF are fundamentally useless for preventing X-risk. At the same time, AI companies have vastly more resources than me. So if they play the “find counterexample ==> hotfix ==> iterate” whack-a-mole game against me, they will always win. (And then they play against misaligned superintelligence, and literally everybody dies.) As a result, my goal with legitimizing jailbreaking is not to generate more data for RLHF, but to evaluate the hypothesis that the AI companies don’t know what they are doing, and their current methods are insufficient for preventing X-risk.

Therefore, to make the “jailbreaking game” a fair evaluation of the above hypothesis, it would be good to allow people to test the AI without revealing their jailbreaking strategies. This could take various forms: The jailbreaking platform could be run by a hopefully-independent third party. The user could decide to only share the final output of the model, not the prompt. The user could decide to only share a portion of the final output, scramble it somehow, etc. Or the user could only share the output with specific trusted people, who then vouch for their claims about the output. (Or some solutions along the lines of “cryptographical magic something something”.)

  1. ^

    Though most people might find it somewhat less harmful than literally everybody dying.

  2. ^

    However, the real target would need to be wider (or there would need to be multiple targets), because here the AI company can attempt to cheat by wrapping the whole model in an elephant filter. And with high reputation stakes, I absolutely do not trust them to not cheat.

  3. ^

    Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)

  4. ^

    Clearly, having the AI company itself maintain this is a conflict of interests.