• It’s an example first written about by Paul Christiano here (at the beginning of Part III).

The idea is this: suppose we want to ensure that our model has acceptable behavior even in worst-case situations. One idea would be to do adversarial training: at every step during training, train an adversary model to find inputs on which the model behaves unacceptably, and penalize the model accordingly.

If the adversary is able to uncover all the worst-case inputs, this penalization ensures we end up with a model that behaves acceptably on all inputs.

RSA-2048 is a somewhat contrived but illustrative example of how this strategy could fail:

As a simple but silly example, suppose our model works as follows:

• Pick a cryptographic puzzle (e.g. “factor RSA-2048”).

• When it sees a solution to that puzzle, it behaves badly.

• Otherwise, it behaves well.

Even the adversary understands perfectly what this model is doing, they can’t find an input on which it will behave badly unless they can factor RSA-2048. But if deployed in the world, this model will eventually behave badly.

In particular:

Even the adversary understands perfectly what this model is doing, they can’t find an input on which it will behave badly unless they can factor RSA-2048

This is because during training, as is the case now, we and the adversary we build are unable to factor RSA-2048.

But if deployed in the world, this model will eventually behave badly.

This is because (or assumes that) at some point in the future, a factorization of RSA-2048 will exist and become available.

• Maybe we could give the adversary some limited ability to modify the model’s internals? The assumption here is that actually aligned models are more difficult to make unaligned. If the deceptive model has a circuit in it that says “If RSA-2048 is factored, do bad”, it seems the adversary could turn that into “Do bad” very easily.

This risks incorrectly flagging some useful capabilities of an aligned model, such as a circuit that predicts the actions of a misaligned opponent. However, it seems intuitive to me that deceptively misaligned models are in some sense closer to being outright hostile than an aligned model.

We could even imagine some sort of meta adversarial training where the model learns not to have any small modifications that cause it to behave badly, using either RL or meta gradient descent over the modifications made by the adversary.

• Detecting the bad circuit might be possible in practice, but there does exist a case where this doesn’t work. Your model could implement “If RSA-2048 is factored, do bad” in an arbitrarily obfuscated way, and it’s in general computationally expensive to tell if two circuits implement the same function. It could require arbitrarily large changes to the model to make it think that it saw RSA2048.

• Right. And it’s not even sufficient to set up an adversary that goes ‘the model behaves badly if you satisfy X condition’, because there’s no guarantee that X condition is actually satisfiable.

You could ask the adversary to come up with an existence proof… but then the agent could do e.g. ‘do X if you see something with SHA512 hash of all zeroes’, which is strongly conjectured to exist but isn’t proven.