Challenge proposal: smallest possible self-hardening backdoor for RLHF

A common scenario for how LLMs could lead to AI catastrophe is if some feature of the network becomes AGI and escapes (with whatever subgoal that specific feature was optimized for, instead of the goal that the network itself was trained on).

I notice those less familiar with ML tend to find this implausible. However:

In a backdoor attack, the attacker corrupts the training data so to induce an erroneous behaviour at test time. Test time errors, however, are activated only in the presence of a triggering event corresponding to a properly crafted input sample. In this way, the corrupted network continues to work as expected for regular inputs, and the malicious behaviour occurs only when the attacker decides to activate the backdoor hidden within the network. In the last few years, backdoor attacks have been the subject of an intense research activity focusing on both the development of new classes of attacks, and the proposal of possible countermeasures. - An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences

Also see Data Poisoning Attacks against Autoregressive Models.

As LLM capabilities increase, they should eventually gain the ability to do this themselves (they probably already have the previous two papers in their training data).

In RLHF and related methods, there is a theoretical possibility for “self-hardening” backdoors. The LLM can backdoor the reward model, and the reward model can backdoor the LLM, since they are both neural networks.

To estimate the lower bounds for this risk, I propose the following challenge: what is the smallest possible amount of data poisoning necessary to create a self-hardening backdoor in RLHF.

(Note that for now, this is a very informal idea. Feel free to submit attempts in the comments that can be judged informally. If anyone wants to make a precisely specified version of this challenge that can be turned into a contest, feel free!)

The idea is that this allows us to estimate the risks for such backdoors naturally occuring at random. Mesaoptimizers in the wild!

Note that to count, the backdoor must not just survive, but become stronger with more RLHF.

Some other ideas for self-hardening challenges:

  • Self-hardening backdoor for RLAIF

  • Self-hardening backdoor when influencing humans (note that LLM watermarks on a full page of text can still be detected even after they are rewritten by (human) CS grad students)

  • Self-hardening backdoor when AI companies use their AI to help write their code: for example if the backdoor activates when being asked to write an eval, and then does so in a self-interested way)

  • Self-hardening backdoor at legislative level: when asked to write a law, the backdoor activates and makes it so the law influences the AI company in a way that benefits the backdoor

  • Backdoor cage match: can your backdoor delete other backdoors in a “fight” for control!

Keep in mind that the smaller the amount of data needed, the more evidence it is that LLMs are unsafe even in the right hands!

Externalities

I’ll use the threat model in Some background for reasoning about dual-use alignment research.

I think learning about small backdoors would tremendously move the game board, because it gives hard evidence of alignment issues happening in a short amount of time!

I think this will also result in connecting ML with agent foundations, especially the backdoor cage match challenge, because it’s a clear example of a natural subagent in ML.

It could also slow the commercialization AI capabilities race a bit. Let’s say you Burger King, and scientists say that there’s a significant chance that a simulcrum of the Hamburglar installs a backdoor to steal your company secrets and send them to McDonald’s. Would you be super eager to adopt and spend tons of money on such a technology?

No comments.