My Plan to Build Aligned Superintelligence

2ish months ago, I realized making sure superintelligent AI doesn’t kill everyone was the biggest problem period. I resolved to understand and maybe help solve the problem. A friend of mine who does alignment research said that, in order to clarify my thinking, I should sit down and try to write my best solution to aligning a superintelligent AI in an hour. I did this, and found it extremely helpful; I then edited it and stared angrily at it for a few more hours to make it less embarrassing to post on Lesswrong. This is, so far, my best plan to solve AI alignment. I found writing this very helpful for clarifying my own model of alignment, and what I was still confused about. I would highly recommend other people starting to think about AI alignment try this exercise. Enjoy!

Problem Statement

To solve alignment and build an aligned artificial general intelligence, you’d need a smarter-than-human system that does what you want it to do, you’d need some confidence that it will do this thing before you hand it a machete and say “sure, do whatever,” and you’d also probably need it to change its behavior in response to you telling it to not do something, telling you what to do, etc., but in a way that represents what you’re actually asking it to do, not some proxy that it can mess with. You’d also need to do this before anyone makes an unaligned AGI, and it would have to prevent new people from making unaligned AGIs.

To do this, you want an AGI that understands what you actually want to happen in the real physical world, is incentivized to model this instead of optimizing on one specific criterion like “I see a video of the thing that maximizes my reward,” and will change its behavior in response to you indicating you want it to do something else. It seems like this kind of system would need to be actively trying to do something similar to how we “do science” to figure out about the world. Instead of optimizing one specific criterion, which it could game, like n of people who press an “I’m happy” button or smile in front of a camera, it would understand there’s some sort of underlying process, that specific details are proxies for, that it’s trying to model.

We sometimes take this quality for granted because humans actually care about things in the real world, since our training environment required us to locate and consume real food with specific nutrients so we could walk around and make real babies. Part of the tricky bit is what reward you actually put in place that correlates with “Do x things in the real world” and not just “patterns of camera data that match the original training set I gave you,” or “some heuristic that correlates to humans smiling in the training distribution but is actually best matched by a weird pattern to tile the universe in outside of the training distribution.” It seems like whatever you do to make this happen doesn’t involve just trying to program some kind of friendliness/deference. If you ask an AI to turn a brick into a diamond, and it projects a video of a brick turning into a diamond in front of your camera, it’s not necessarily going like “haha, I am deceiving the human into thinking there is a brick turning into a diamond here, but really there isn’t!!!” It’s quite likely the reasoning process seems more like “I want to make it s. t. there’s this video output, this is the most efficient way to do it given my available resources.” Similarly, if you try to ask it to defer to you/map your preferences before it does something, but it can simulate a better/easier to work with version of “you,” it seems pretty likely to do this unless you know you’re getting it to care about You, Real Human Who Is Making AGI.

OK, that’s the problem statement. My plan for making an AI that models things in a way that you model things, AND will hopefully shut down when you ask it to, AND you can get to do a goal you actually want, is below. I assume you use something basically like modern deep learning because I haven’t had time to think of a new paradigm for artificial intelligence in the past few hours.

First, take several steps where you have an AI model the physical world before you deploy it.

Spend a fair amount of time training agents that predict how scenarios will unfold in specific boxed environments, in a way you think will actually get it to model the world. I claim that you should do this for a while before you give any AI a goal to do things to objects in the world because this kind of predicting system you’re training probably has less incentive to do something you see as deceptive and has much slower and less clear paths to destroying humans; and once you get a modeler you trust it can help you make a plan for a thing that you want to do.
- Get them to predict sensor readings for specific kinds of data on the things. For example, if they’re trying to predict the force of a brick falling, have the model separately predict future video given past video, predict future audio sensor readings given past audio, predict one kind of sensor output given the other, w a bunch of sensors for the same brick falling so the thing has a cluster of kinds of sensor readings that correlate w a certain phenomenon, & you get it to use a bunch of them to predict each other & make some kind of cluster of predicted data in a bunch of sensors for any one scenario. After this, have humans label this physical data with certain natural language descriptors & measurements for parameters humans would use to describe it (e. g. height of brick, speed of brick), train it on natural language descriptors & parameters. Then, train it to label its predictions in human-speak & human measurements.
- Make this kind of predictor predict things in human-speak reliably using different examples, e. g. if you turn the lights off in the room, & so there’s no video, make sure it’s still predicting a “falling brick;” use a bunch of variations/examples where you predict it’ll do stuff wrong like this to check if its model looks like yours. This seems pretty important to do for a while, and it seems reckless to deploy an AI to actually do things in the real world that you haven’t done these kinds of tests on before.
  - Problem: this training data is expensive to get compared to random Youtube videos, so this training strategy is quite expensive & costs a lot of time (current ML is already bottlenecked by limited training data). An AI predicting things with just video could become more competitive more quickly. IDK, though, I expect training data to get more expensive in general as anyone is trying to get from large language models and some video prediction to AGI that can do things.

I claim that if you do this kind of training strategy right, you can get it to predict if real diamonds are getting made in a room, train some sort of thing trying to make diamonds with a loss function of this prediction model’s prediction on whether there are diamonds in the environment, and solve Eliezer’s toy problem of how you can get an AI to turn things into diamonds and nothing else [although you shouldn’t build this because it’d be dangerous].

Next, you have to solve a pretty hard problem, where you go from your predictor AI you have in your dark room underground modeling specific toy physical scenarios to your predictor AI understanding broader abstractions/general goals, anywhere from getting you a cup of coffee to, say, burning all the GPUs that aren’t it but without killing any humans/doing anything you find shitty.

Label things that you have more abstract descriptors for, like “two people making a business deal” or “acacia tree growing” or “person shutting down an AI.” Train it with some data on these things; then try to get it to produce full physical projections (3D models, some extra physical data) of what these things would entail, and get people to label if its predictions’ text descriptors match how a human would describe them (would be better/faster at producing full 3d models & other physical data w some extra data than humans in a similar way to how it’s better at making images than humans)
I claim at some point you should try to maintain shutdownableness by having the AI use this kind of modeling to project its own “would shut down if a human pressed shutdown button”ness and have this be a part of the goal you train it for. For this, the AI would have to have a good understanding of abstractions and not just bricks falling or diamonds being produced to put into itself though. But it also seems easier to learn than reliably learning all human values, so it would be one of the things I start trying to do.

Note: as you’re doing this I also imagine frequently doing a bunch of “it’s not trying to kill you” interpretability checks, the specific form of which depends on what your predictor slash budding AGI looks like. I also imagine you doing this in an underground bunker where you feed it hard drives of data and try not to give it access to the Internet. This is kind of a big ask given how current AGI development is going, but whatever, let’s pretend I can do this.

At some point in making this modeling agent, you prompt it with something like “series of actions that would involve preventing new AGIs that aren’t you from being built,” where you’re basically asking it to predict what AI it would have to make to do a pivotal act that prevents other unaligned AGI from being made. This involves some kind of model of some examples of specific actions that it would “recommend” taking, which you can check to see if they suck/are evil.
You try some basic training of it on human values by getting it to predict a bunch of physical scenarios where it does things [some of this may be “parts of preventing unaligned AGI plan”], and getting humans whose moral compass you trust to label them; probably, you try to feed it statements about ethics as training data for cheapness and then get it to use this info to show you its labels on specific scenarios and label if things it projects are good or shitty. This is basically asking it to model “do humans say this projected world state is OK, and would they prioritize it over X other world states,” which I don’t think will make an AGI totally aligned, but you can probably use to get it to on average do less shitty things while remaining probably shutdownable as you prepare for pivotal act where it tries to prevent all unaligned AGIs from being built.
Eventually this AI’s predictions give you enough info to help you make an AGI that prevents other AGIs from being built, mostly doesn’t do things you find ethically egregious, and will allow you to shut it down. Now, you’re ready to do that thing I was talking about where you have it deploy to kill any other potential AGI but hopefully not humans.
If you do this successfully, congrats!!!! Probably you’re ready to move on to making these predictor AIs make new AIs to do concrete things like invent better agricultural methods, solve specific human engineering problems that are impressive and raise world GDP & give people time to reflect on human values.
Now you’ve solved a lot of problems and have a bunch of time to reflect on what human values you want to put in the AGI/the post-AGI future of humanity. You’ve basically solved the alignment problem, I think. Great job.

OK, that’s my best plan to make aligned AGI. TL;DR:

Try to make it do specific predictions about the physical world in narrow environments and THEN test if they match with how humans describe them.
Get it to predict how to prevent other AGIs from being built. Check if its predictions suck.
Get it to help you make AGI that prevents other, not aligned AGIs from being built but doesn’t do stuff you find egregious and will shut down if you ask it to.
Do a long reflection of undetermined time on human values and then slowly train more AI systems with your now-enlightened human values.

My central intuition here is you have to get it to do things with a world-model that matches yours, and it doesn’t actually seem like there’s some key conceptual solution that will make it do this, you just have to do a bunch of training to make it do this and check a bunch of ways it could go wrong while keeping an eye on it to make sure it’s not trying to kill you.

I’ve thought of other ways that stuff in this plan might go wrong, and specific patches you try to do to make this less likely to happen, but they’d be exhausting to write down or make the post too long; am very happy to hear objections/criticisms of why this approach seems wrong or why people think the actual solution looks very different, and I’ll reply to your comment and tell you my patches for it or be like “damn, you’re right” if your comment is really good. Or send me hate mail at my email apollonianblues@gmail.com.

Thanks for reading my Very Great Plan for Making the Aligned Superintelligence! Again, I found doing this quite helpful for clarifying my personal models, and I’d highly recommend other people write down their Plans for Aligning Superintelligence as well.