A positive case for how we might succeed at prosaic AI alignment

This post is my attempt at something like a response to Eliezer Yudkowsky’s recent discussion on AGI interventions.

I tend to be relatively pessimistic overall about humanity’s chances at avoiding AI existential risk. Contrary to some others that share my pessimism, however—Eliezer Yudkowsky, in particular—I believe that there is a clear path forward for how we might succeed within the current prosaic paradigm (that is, the current machine learning paradigm) that looks plausible and has no fundamental obstacles.

In the comments on Eliezer’s discussion, this point about whether there exists any coherent story for prosaic AI alignment success came up multiple times. From Rob Bensinger:

I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.

In that spirit, I’d like to provide my own object-level story for how prosaic AI alignment might work out well.

Of course, any specific story for how we might succeed is going to be wrong simply because it specifies a bunch of details, and this is such a specific story. The point, however, is that the fact that we can write plausible stories for how we might succeed that don’t run into any fundamental obstacles implies that the problem isn’t “we don’t even know how we could possibly succeed” but rather “we know some ways in which we might succeed, but they all require a bunch of hard stuff that we have to actually execute on,” which I think is a pretty different place to be.

One thing I do agree with Eliezer on, however, is that, when you’re playing from behind—as I think we are—you play for variance. That means embracing strategies that might not work in expectation, but that have long tails in the positive direction, and I definitely see my picture here as falling into that category.

Furthermore, as one should probably expect with any full roadmap for solving such a complicated problem, there’s still a lot left out of my picture—especially my intuitions for why I think each of these steps is actually plausible. I am currently working on a research agenda that will go into a lot more detail on that, but until it’s published, it might be best to just think of this post as an overview of what that agenda will look like.[1]

Alright, without further ado, here’s my concrete picture for how we might end up succeeding:

  1. We produce an understanding of a simple, natural class of agents such that agents of this form are capable of doing all of the things that we might want a powerful, advanced AI to do—but such that no agents of this form will ever act deceptively.

    • My current best guess for what such a natural class might look like is a myopic agent—that is, an agent that only cares about its next action rather than the long-term consequences of its actions. I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive.[2]

    • In the language of training stories, (1) gives us our training goal, the mechanistic description of what sort of model we’re trying to produce.

  2. We develop some way of determining whether a given non-deceptive model falls into the natural class we developed in step (1). It’s fine for this not to work for all non-deceptive models, as long as the class of non-deceptive models that it works on is large enough to make (3) and (4) go through.

    • Note that we will never rely on (2) working in a situation where we are given an agent that is already deceptive.

    • One way to accomplish (2) might be to develop worst-case transparency tools that can tell whether the basic structure of a given model is consistent with (1).

  3. We develop a training procedure such that, given that the current model being trained falls into the natural class from step (1), additional training will always keep it in that class.

    • If agents from our natural class are capable of being able to deploy the tools we developed in step (2), then one way to accomplish (3) might be to have the training be done by the model being trained given access to tools from (2).

    • For (3) to just work very straightforwardly, it would need to be the case that the set of models that (2) works for is large enough to include any model that can be reached from one step of training starting from a model in the natural class from (1).

  4. Using (2), we guide very early training (before the model has the capability to be deceptive) to get some model (with which we can initialize further training from) that falls into the natural class from (1).

    • For (4) to just work very straightforwardly, it would need to be the case that the set of models that (2) works for is large enough to include any model that can be produced early in training before the model has the capability to be deceptive.

    • Alternatively, the natural class from (1) could just be broad enough to include most models at initialization, though I suspect that will cause problems for (3).

  5. Combining (3) and (4), we get an inductive guarantee that we can produce models that fall into our natural class. Because in (1) we constructed our natural class to be sufficient for any tasks that we might want our AI to do, we can now train non-deceptive AIs on any task that we might want them for.

    • For the training process in (5) to be competitive, we also need (3) and (4) to not be so resource-intensive that they are substantially harder than training an unaligned model.

    • In the language of training stories, the inductive argument here in (5) is our central training rationale for why we’ll get a model that satisfies the training goal from (1).

  6. Given a powerful and non-deceptive AI produced from (5), we use standard red-teaming (e.g. testing on lots of examples) to find places where the model fails and retrain using (3) until the model looks like it’s doing the right thing.

    • Because we know that our model is non-deceptive from (5)—and since all of our retraining is done via (3)—the fact that the model looks like it’s doing the right thing should give us a real guarantee that it’ll actually do the right thing in similar situations, since we know it won’t just be pretending to do the right thing.

  7. We ensure that the leading AI lab uses (5) + (6) to produce their most powerful and advanced AI systems. By being first, they are able to set the standard for how training powerful machine learning models should generally be done.[3] Because of the strong tendencies for AI labs to copy each other’s successes, other labs also use (5) + (6) to train their powerful and advanced AI systems, ensuring that all of the most powerful AIs in the world are aligned.

    • Though these other labs might scale (5) + (6) further, as long as (3) is robust to scale, such systems should stay aligned.

    • Though I think that the forces pushing for homogeneity of AI training processes across labs are strong, once the set of labs with the capability to build misaligned AI systems gets large enough—e.g. once it includes all the small labs too—one of them is bound to break that homogeneity. Thus, there is a period of vulnerability after (7) and before (8) where smaller labs might not follow (5) + (6) and instead build misaligned AI systems.

      • Even if that does happen, however, since it’s only small labs with limited capabilities building misaligned AI in a world that already contains aligned AIs built by much larger and more capable labs, it should be quite difficult for such misaligned AI systems to actually destabilize such a world.

  8. We use the AI systems from (7) to help us design the next round of powerful and advanced AI systems and develop techniques to end the period of vulnerability.

    • I won’t say too much about exactly what we would do here, mostly because it’s not a problem that we have to solve before we actually get the powerful aligned AI systems to help us solve it, so it’s mostly not a problem that I think we need to focus on right now.

If I had to guess what the hardest part of the above picture will be, I’d probably guess (2),[4] which is why I’m so excited about Automating Auditing as a way to start making progress on (2) now. That being said, I don’t think there are any fundamental obstacles to solving (2)—(2) very explicitly doesn’t require us to be robust to deceptive models or even be able to tell whether (1) holds for all non-deceptive models, both of which I think would run into fundamental obstacles, but which we don’t have to do.


  1. ↩︎

    If you want access to an early draft of my agenda, message me privately and I might send it to you, though it’s still likely to change a lot before it’s released.

  2. ↩︎

    I think it is possible for a myopic agent to still be capable of solving problems that involve non-myopic reasoning (e.g. be a good AI CEO). For example, a myopic agent could myopically simulate a strongly-believed-to-be-safe non-myopic process such as HCH, allowing imitative amplification to be done without ever breaking a myopia guarantee—alternatively, AI safety via market making lets you do AI safety via debate without breaking a myopia guarantee. In general, I think it’s just not very hard to leverage careful recursion to turn non-myopic objectives into myopic objectives such that it’s possible for a myopic agent to do well on them—without breaking the guarantees that ensure that your myopic agent won’t be deceptive (as a concrete example of what a myopic agent that is capable of doing well on such tasks without ever having any reason to be deceptive might look like, consider LCDT).

  3. ↩︎

    For an example of what “setting the standard for how training powerful ML models should be done” might look like, consider how once the basic training paradigm of “train massive self-supervised transformer-based language models” was introduced, it was aggressively copied across the field and became the standard for all language-based AI systems.

  4. ↩︎

    Second place for hardest step would probably be (7). Definitely a hard step, but I think that the claim that we mostly only have to persuade the frontrunner makes this not a fundamental obstacle—persuading one organization of one thing is an achievable goal, persuading every person doing AI everywhere would be a fundamental obstacle.