A dilemma for prosaic AI alignment

Epistemic status: I predict that people who focus on prosaic AI alignment have thought of this before, in some way at least. But I don’t know what they would say in response, so I’m writing this up so I can find out! I’m making it a post instead of an email so that the discussion can be public.

Characterization of prosaic AI alignment: Prosaic AI methods—the sort of methods that we are using today, rather than hypothetical new methods based on a deeper understanding of intelligence—might be sufficient to make human-level AGI in the next two decades or so, and that if this happens we’d better be prepared. Thus we should think about how to take prosaic AI methods and combine them or modify them in various ways to make something which is as competitive, or almost as competitive, as cutting-edge AI. Examples of this approach are debate, imitating humans, preference learning, and iterated distillation and amplification.

Conjecture: Cutting-edge AI will come from cutting-edge algorithms/​architectures trained towards cutting-edge objectives (incl. unsupervised learning) in cutting-edge environments/​datasets. Anything missing one or more of these components will suffer a major competitiveness penalty.

  • Example: Suppose that the best way we know of to get general intelligence is to evolve a population of giant neural nets with model-free learning in a part-competitive, part-cooperative, very diverse environment consisting of an ensemble of video games. One year, the systems that come out of this process are at dog level, then two years later they are at chimpanzee level, then two years later they are at IQ-80 human level… It is expected that scaling up this sort of thing will lead, in the next few years, to smarter-than-human AGI.

The Dilemma: Choose plan 1 or plan 2:

Plan 1: Train a system into your scheme from scratch, using cutting-edge algorithms but not cutting-edge environments or objectives. (The environments and objectives are whatever your safety scheme calls for, e.g. debates, imitating humans, a series of moral choice situations, etc.)

  • Example: We take the best training algorithms and architecture we can find, but instead of training on an ensemble of video games, our AI is trained from scratch to win debates with human judges. We then have it debate on important topics to give us valuable information.

  • Problem with plan 1: This is not competitive, because of Conjecture. Continuing the example, if our AI is even able to debate on complex topics at all, it won’t be nearly as good at getting to the truth on them as it would be if it was built using Plan 2...

Plan 2: Train a cutting-edge AI system, and then retrain it into your AI alignment scheme.

  • Example: You use all the cutting-edge methods to train an agent that is about as generally intelligent as the average IQ 80 human. Then, you retrain it to win debates with human judges, and have it debate on important topics to give us valuable information.

  • Problem with plan 2: This is a recipe for making a deceptively aligned mesa-optimizer. The system trained with cutting-edge methods will be an unsafe system; it will be an optimizer with objectives very different from what we want. Our retraining process better be really good at changing those objectives… and that’s hard, for reasons explained here and here.

Conclusion: I previously thought that mesa-optimizers would be a problem for prosaic AI safety, in the generic sense: If you rely on prosaic methods for some of your components, you might accidentally produce mesa-optimizers some of which might be misaligned or even deceptively aligned. Now I think the problem is substantially harder than that: To be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers and then (hopefully) figure out how to align them so that they can be used in the scheme.

Of course, even if they suffer major competitiveness penalties, these schemes could still be useful if coupled with highly successful lobbying/​activism to prevent the more competitive but unsafe AI systems from being built or deployed. But that too is hard.

EDIT: After discussion in the comments, particularly with John_Maxwell, (though also this fits with what Evan and Paul said) I’m moderating my claims a bit:

Depending on what kind of AI is cutting-edge, we might get a kind that isn’t agenty. In that case my dilemma doesn’t really arise, since mesa-optimizers aren’t a problem. One way we might get a kind that isn’t agenty is if unsupervised learning (e.g. “predict the next word in this text”) turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.


Footnote about competitiveness:

I think we should distinguish between two dimensions of competitiveness: Resource-competitiveness and date-competitiveness. We can imagine a world in which AI safety is date-competitive with unsafe AI systems but not resource-competitive, i.e. the insights and techniques that allow us to build unsafe AI systems also allow us to build equally powerful safe AI systems for a substantially higher price. We can imagine a world in which AI safety is resource-competitive but not date-competitive, i.e. for a dangerous period of time it is possible to make unsafe powerful AI systems but no one knows how to make a safe version, and then finally people figure out how to make a similarly-powerful safe version and moreover it costs about the same.

I think the argument I give in this post applies to both kinds of competitiveness, but I’m particularly concerned about date-competitiveness.


Thanks to David Rein, Ramana Kumar, and Evan Hubinger for brief and helpful conversations that led to this post.