Try to solve the hard parts of the alignment problem

My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time.[1] Alignment plans that I’ve heard[2] fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post.

(Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.)

Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function.

There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the loss function(s) w.r.t. the parameters pointing towards neural networks implementing these algorithms.

Increasingly context-aware and capable agents achieve a better score on a wide range of scoring functions than their neighbors and will, by default, attract gradient descent.

Unfortunately, we haven’t solved agent foundations: we have these powerful search processes, and if you imagine the space of all possible AGIs (or possible neural networks, or possible minds), there are some areas that are aligned AGIs, but we have no idea how to define them, no idea how to look for them. We understand how all designs for a search process people came up with so far end up somewhere that’s not in an area of aligned AGI[3], and we also understand that some areas with aligned AGIs actively dispel many sorts of search processes. We can compare an area of aligned AGIs to the Moon. Imagine we’re trying to launch a rocket there, and if after the first take-off, it ends up somewhere that’s not the Moon (maybe after a rapid unplanned disassembly), we die. We have a bunch of explosives, but we don’t have equations for gravity, only maybe some initial understanding of acceleration. Also, actually, we don’t know where the Moon is in space; we don’t know how to specify it, we don’t know what kind of light we can look for that many other things wouldn’t emit, etc.; we imagine that the Moon must be nice, but we don’t have a notion of its niceness that we can use to design our rocket; we know that some specific designs definitely fail and end up somewhere that’s not the Moon, but that wouldn’t really help us to get to the Moon.

If you launch anything capable and you don’t have good reasons to think it’s an aligned mind, it will not be an aligned mind. If you try to prevent specific failure modes- if you identify optimizations towards something different from what you want, or how exactly gradient descent diverges somewhere that’s certainly not aligned- you’re probably iteratively looking for training setups where you don’t understand failure modes instead of setups that actually produce something aligned. If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.

When you get to a system capable enough to meaningfully help you[4], you need to have already solved this problem. I think not enough people understand what this problem is, and I think that if it is not solved in time, we die.

I’ve heard many attempts to hide the hard problem in something outside of where our attention is directed: e.g., design a system out of many models overseeing each other, and get useful work out of the whole system while preventing specific models from staging a coup.

I have intuitions for why these kinds of approaches fail, mostly along the lines of reasons for why, unless you already have something sufficiently smart and aligned, you can’t build an aligned system out of it, without figuring out how to make smart aligned minds.


(A sidenote: In a conversation with an advocate of this approach, they’ve mentioned Google as an example of a system where the incentive structures inside of it are such that the parts are pretty aligned with the goals of the whole system and don’t stage a coup. A weak argument, but the reason why Google wouldn’t kill competitors’ employees even if it would increase their profit is not only the outside legal structures but also smart people whose values include human lives, who work at Google, don’t want the entire structure to kill people, and can prevent the entire structure from killing people, and so the entire structure optimizes for something slightly different than profit.

I’ve had some experience with getting useful work out of somewhat unaligned systems made of somewhat unaligned parts by placing additional constraints on them: back when I lived in Russia, I was an independent election observer, a member of an election committee (and basically had to tell everyone what to do because they didn’t really know the law and didn’t want to follow the procedures), and later coordinated opposition election observers in a district of Moscow. You had to carefully place incentives such that it was much easier for the members of election commissions to follow the lawful procedures (which were surprisingly democratic and would’ve worked pretty well if only all the election committees consisted of representatives of competing political parties interested in fair results, which was never the case in the history of the country). There are many details there, but two things I want to say are:

  • the primary reason why the system hasn’t physically hurt many more independent election observers than it did is that it consisted of many at least somewhat aligned humans who preferred other people not to be hurt- without that, no law and no incentive structure independent observers could provide would’ve saved their health or helped to achieve their goals;

  • the system looked good to the people on top, but without the feedback loops that democracies have and countries without real elections usually don’t, at every level of the hierarchy people were slightly misrepresenting the situation they were responsible for because that meant a higher reward, and that led to some specific people having a map so distorted that they decided to start a war, thinking they’ll win it in three days. They designed a system that achieved some goals, but what the system actually was directed at wasn’t exactly what they wanted. No one controls what exactly the whole system optimizes for, even when they can shape it however they want.)


To me, approaches like “let’s scale oversight and prevent parts of the system from staging a coup and solve the alignment problem with it” are like trying to prevent the whole system from undergoing the Sharp Left Turn in some specific ways, while not paying attention to where you direct the whole system. If a capable mind is made out of many parts being kept in balance, it doesn’t help you with the problem that the mind itself needs to be something from the area of aligned minds. If you don’t have reasons to expect the system to be outer-aligned, it won’t be. And unless you solve agent foundations (or otherwise understand minds sufficiently well), I don’t think it’s possible to have good reasons for expecting outer alignment in systems intelligent and goal-oriented enough to solve alignment or turn all GPUs on Earth into Rubik’s cubes.

There are ongoing attempts at solving this: for example, Infra-Bayesianism tries to attack this problem by coming up with a realistic model of an agent, understanding what it can mean for it to be aligned with us, and producing some desiderata for an AGI training setup such that it points at coherent AGIs similar to the model of an aligned agent.[5] Some people try to understand minds and agents from various angles, but everyone seems pretty confused, and I’m not aware of a research direction that seems to be on track to solve the problem.

If we don’t solve this problem, we’ll get a goal-oriented powerful intelligence powerful enough to produce 100k hours of useful alignment research in a couple of months or turn all the GPUs into Rubik’s cubes, and there will be no way for it to be aligned enough to do these nice things instead of killing us.

Are you sure that if your research goes as planned and all the pieces are there and you get to powerful agents, you understand why exactly the system you’re pointing at is aligned? Do you know what exactly it coherently optimizes for and why the optimization target is good enough? Have you figured out and prevented all possible internal optimisation loops? Have you actually solved outer alignment?

I would like to see people thinking more about that problem; or at least being aware of it even if they work on something else.

And I urge you to think about the hardest problem that you think has to be solved, and attack that problem.

  1. ^

    I don’t expect that we have much time until AGI happens; I think progress in capabilities happens much faster than in this problem; this problem needs to be solved before we have AI systems that are capable enough to meaningfully help with this problem; alignment researchers don’t seem to stack.

  2. ^

    E.g., scalable oversight in the lines of what people from Redwood, Anthropic, and DeepMind are thinking about.

  3. ^

    To be clear, I haven’t seen many designs that people I respect believed to have a chance of actually working. If you work on the alignment problem or at an AI lab and haven’t read Nate Soares’ On how various plans miss the hard bits of the alignment challenge, I’d suggest reading it.

    For people who are relatively new to the problem, an example of finding a failure in a design (I don’t expect people from AI labs to think it can possibly work): think what happens if you have two systems: one is trained to predict how a human would evaluate a behavior, another is trained to produce behavior the first system predicts would be evaluated highly. Imagine that you successfully train them to a superintelligent level: the AI is really good at predicting what buttons humans click and producing behaviors that lead to humans clicking on buttons that mean that the behavior is great. If it’s not obvious at the first glance why an AI trained this way won’t be aligned, it might be really helpful to stop and think for 1-2 minutes about what happens if a system understands humans really well and does everything in its power to make a predicted human click on a button. Is the answer “a nice and fully aligned helpful AGI assistant”?
    See Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities for more (the problem you have just discovered might be described under reason #20).

  4. ^

    There are some problems you might solve to prevent someone else from launching an unaligned AI. Solving them is not something that’s easy for humans to do or oversee and probably requires using a system whose power is comparable to being able to produce alignment research like what Paul Cristiano produces but 1,000 times faster or turn all the GPUs on the planet into Rubik’s cubes. Building systems past a certain threshold of power is something systems below that threshold can’t meaningfully help with; this a chicken and egg situation, and systems below the threshold can speed up our work, but some hard work and thinking, the hard bits of solving the problem of building the first system past this threshold safely is on us; systems powerful enough to meaningfully help you are already dangerous enough.

  5. ^

    With these goals, the research starts by solving some problems with traditional RL theory: for example, traditional RL agents, being a part of the universe, can’t even consider the actual universe in the set of their hypotheses, since they’re smaller than the universe; a traditional bayesian agent would have a hypothesis as a probability distribution over all possible worlds; but it’s impossible for an agent made out of blocks in a part of a Minecraft world to assign probabilities to every possible state of the whole Minecraft world.

    IB solves this problem of non-realizability by considering hypotheses in the form of convex sets of probability distributions; in practice, this means, for example, a hypothesis can be “every odd bit in the string of bits is 1”. (This is a set of probability distributions over all possible bit strings that only assign positive probabilities to strings that have 1s in odd positions; a mean of any two such probability distribution also doesn’t assign any probability to strings that have a 0 in an odd position, so it’s also from the set, so the set is convex.)