Why I’m co-founding Aligned AI

I’m leaving the Future of Humanity Institute, the best and most impactful job I’ve ever had, to co-found Aligned AI. For the first time in my research career, I feel the problem of AI alignment is solvable.


Alignment research: a history of partial failures

The history of AI safety is littered with failures and partial successes. The most common examples of failure are ideas that would work well typically, but which fail in extreme situations—and a superintelligent AI is perfectly capable of creating such situations.

  • Low-impact AIs were supposed to allow smart machines that interacted with humans without causing huge disruptions. They had some success at ‘almost no impact’. But everyone—including me—failed at developing algorithms that had reliable low-impact. If the AI is allowed even a little bit of impact, it can make these low-impact restrictions irrelevant.

  • Corrigibility and interruptibility were designed to allow AIs to be reprogrammed even when active and powerful. They have good narrow uses, but aren’t a general solution: though the AI would not interfere with the interruptibility process, it also has no incentive to preserve it or to ensure its subagents were also interruptible.

  • Oracles, question answering AIs (and their close relatives, tool AIs) are perennial suggestions, the idea being to limit the power of the AI by limiting it to answering questions or giving suggestions. But that fails, for instance when the AI is incentivised to manipulate humans through the contents of its answers or suggestions.

  • There were some interesting examples on limiting AI power, but these were ultimately vulnerable to the AI creating subagents.

  • The different forms of value learning confronted a surprising obstacle: values could not be learnt without making strong assumptions about human rationality, and human rationality could not be learnt without making strong assumptions about human values.

A litany of partial failures suggests that the next approach tried will be a failure as well—unless we can identify why the approaches above failed. Is there a common failure mode for all of them?

The common thread: lack of value extrapolations

It is easy to point at current examples of agents with low (or high) impact, at safe (or dangerous) suggestions, at low (or high) powered behaviours. So we have in a sense the ‘training sets’ for defining low-impact/​Oracles/​low-powered AIs.

It’s extending these examples to the general situation that fails: definitions which cleanly divide the training set (whether produced by algorithms or humans) fail to extend to the general situation. Call this the ‘value extrapolation problem[1], with ‘value’ interpreted broadly as a categorisation of situations into desirable and undesirable.

Humans turn out to face similar problems. We have broadly defined preferences in familiar situations we have encountered in the world or in fiction. Yet, when confronted with situations far from these, we have to stop and figure out how our values might possibly extend[2]. Since these human values aren’t—yet—defined, we can’t directly input them into an algorithm, so AIs that can’t solve value extrapolation can’t be aligned with human values.

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data. Combined with well grounded basic human values, it will allow the algorithm to extrapolate as well as humans can—better, in fact, using its superhuman abilities.

If that’s successful, AIs that value extrapolate and that start aligned, will remain aligned even as they dramatically change the world and confront the unexpected, re-assessing its reward functions when its world-model changes.


We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it. So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI.

To do this, we’ll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won’t immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.

We will get this service into widespread use (a process that may take some time), and gradually upgrade it to a full alignment process. That will involve drawing on our research and that of others—we will remain strongly engaged with other research groups, providing tools that they can use and incorporating their own results into our service.

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions. Contact us in the comments of this post or from our website.

Thanks to LessWrong

I want to thank LessWrong, as a collective entity, for getting us to the point where such a plan seems doable. We’ll be posting a lot here, putting out ideas, asking for feedback—if you can continue giving the same quality of response that you always have (and checking that we ourselves haven’t go misaligned!), that’s all we can ask from you :-)

  1. Formerly called the ‘model splintering’ problem. ↩︎

  2. Humans have demonstrated a skill with value extrapolation, during their childhoods and adolescences, when encountering new stories and thought-experiments, and when their situation changes dramatically. Though human value extrapolation can be contingent, it rarely falls into the extreme failure modes of AIs. ↩︎