The “Backchaining to Local Search” Technique in AI Alignment

In the spirit of this post by John S. Wentworth, this is a reference for a technique I learned from Evan Hubinger. He’s probably not the first to use it, but he introduced it to me, so he gets the credit.

In a single sentence, backchaining to local search is the idea of looking at how a problem of alignment could appear through local search (think gradient descent). So it starts with a certain problem (say reward tampering), and then tries to create a context where the usual training process in ML (local search) could create a system suffering from this problem. It’s an instance of backchaining in general, which just looks for how a problem could appear in practice.

Backchaining to local search has two main benefits:

  • It helps decide whether this specific problem is something we should worry about.

  • It forces you to consider your problem from a local search perspective, instead of the more intuitive human/​adversarial perspective (how would I mess this up?).

Let’s look at a concrete example: reward gaming (also called specification gaming). To be even more concrete, we have a system with a camera and other sensors, and its goal is to maximize the amount of time when my friend Tom smiles, as measured through a loss function that captures whether the camera sees Tom smiling. The obvious (for us) way to do reward gaming here is to put a picture of Tom’s smiling face in front of the camera—then the loss function is minimized.

The backchaining to local search technique applied to this example asks “How can I get this reward gaming behavior by local search?” Well this reward gaming strategy is probably a local minima for the loss function (as changing just a little the behavior would increase the loss significantly), so local search could find it and stay in there. It’s also better than most simple strategies, as ensuring that someone smiles (not necessarily a good goal, mind you) requires rather complex actions in the world (like going full “Joker” on someone, or changing someone’s brain chemistry, or any other weird and impractical scheme). So there’s probably a big zone in model space for which our specific example of reward gaming is the local minima.

All in all, the backchaining to local search technique tells us that this looks like a problem that should happen frequently in practice. Which lines up well with the evidence: see this list of reward gaming examples in the literature, and the corresponding post.

The last thing to point in such a reference post is how to interpret this technique. Because just like models, no technique applies to every situation. If you cannot backchain to local search from your alignment issue, it might mean one of these things.

  • There is a good scenario and you just failed to find it.

  • There is no scenario, or it is very improbable. In that case, I would say that the question lies in whether there are settings in which scenarios do exist and are probable. If not, I personally would see that as saying that this problem could only happen in the case of a complete shift in learning algorithm, which might or might not be an assumption you’re ready to accept.

  • It makes no sense to apply backchaining to your problem. For example, if you think about the issue of non-realizability in learning, you don’t need backchaining to tell you that it’s important in practice. And no model space ever used in practice will contain “reality”. So there’s no point trying to backchain from this problem.

That is, this technique assumes that your problem is a specific behavior of a trained system (like reward gaming), and that learning algorithms will not shift completely before we reach AGI. So it has close ties with the prosaic AGI approach to AI Safety.

In conclusion, when you encounter or consider a new alignment problem which talks about the specific behavior of the AI (compared to a general issue of theory, for example), backchaining to local search means trying to find a scenario where a system suffering from your alignment problem emerges from local search in some model space. If you put decent probability on the prosaic AGI idea, it should tell you something important about your alignment problem.