My Overview of the AI Alignment Landscape: Threat Models

This is the second post in a sequence mapping out the AI Alignment research landscape. The sequence will likely never be completed, but you can read a draft here.

Disclaimer: I recently started as an interpretability researcher at Anthropic, but I wrote this post before starting, and it entirely represents my personal views not those of my employer

Intended audience: People who understand why you might think that AI Alignment is important, but want to understand what AI researchers actually do and why.

Pedagogy note: I link to many papers and blog posts to read more about each area. I think technical writing is often harder to digest without a big picture in mind, so where possible I link to Alignment Newsletter summaries for a piece. There are a lot of links, so I recommend reading the summaries for anything interesting, but being selective about which full-length works you read.

Terminology note: There is a lot of disagreement about what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.


A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.

It is obviously really hard to reason about the future in a specific way without being wildly off! But I am pretty excited about approaches like this. I think it’s easy for research (or anything, really) to be meandering, undirected and not very useful, especially for vague and ungrounded problems such as AI Alignment, which is essentially trying to fix problems in a technology that doesn’t exist yet. And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.

Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.

Pedagogy notes:

  • When discussing threat models, it’s often helpful to give a specific story of exactly how things could go wrong. But this can be misleading, because we often find stories more compelling the more detailed they are, yet mathematically every time you add an extra detail to a story, it becomes less likely (often called the conjunction fallacy). As such, where possible, I try to distill each threat model down to a simple set of assumptions

  • Often, the part I consider most interesting is less the specific threat model, and more the intuitions and worldviews that underlie it. As such, the descriptions of each case are often far longer than necessary, so I can flesh out my intuitions, give illustrative examples, etc. You can disagree with the threat model and agree with the intuitions, and vice versa

    • Feel free to skip around if the sections feel overly long-winded, the high-level sections can be read in any order

Power-Seeking AI

This is the classic case outlined by earlier proponents of AI Alignment, especially Nick Bostrom and Eliezer Yudkowsky. It is outlined most clearly in Superintelligence. Joseph Carlsmith recently wrote a more up-to-date report examining a similar case, and distilling it down to a simpler set of assumptions.

The case

We produce AGI. We believe this will be a goal-directed agent, trying to maximise a goal. Our current techniques cannot shape the goals of AIs very precisely and, worse, human values are highly complex and nuanced and vary between people, making them extremely hard to specify precisely. This will plausibly still be true by the time we produce AGI, so we will probably not be able to give it precisely the right goal. Further, maximising most large-scale goals means the AGI will have many instrumentally convergent goals—it will want to gain power, influence, resources and avoid being turned off, because these are instrumentally helpful for a wide range of tasks.

As goal specification is so hard, the AGI will inevitably want different things from us. It will have superhuman planning capabilities, meaning it will be better at coming up with ways to get what it wants than we will. And so it will likely come up with creative plans that we cannot predict and successfully guard against, because it is very hard to outwit something significantly smarter than you. A specific way this could go wrong is by creating an incentive to deceive us, to act perfectly aligned and to pass all tests we give it, until it can gain enough influence to decisively take power: a treacherous turn. This is not necessarily how things would actually go down, the key point is that if a system is better at planning than us, has different goals, and can influence the world, this can go wrong in many catastrophic ways.

Personally, I overall find this case fairly persuasive, and I expect there are significant grains of truth in this. It is by far the oldest and most established of the threat models I discuss, and has seen far more rigorous treatment than the others, but could still do with significantly more study. In particular, simplistic discussions of this model often bake in significant implicit assumptions, and it has often faced criticism.

When first encountering this case, it’s easy to assume it must apply to future powerful AIs, which I don’t think is obvious. I find it helpful to distill out the implicit assumptions. One set of maybe sufficient assumptions (mostly borrowed from Rohin Shah’s summary of Joseph Carlsmith’s report):

  • Advanced capabilities: The system is able to outperform the best humans on some set of important tasks (such as scientific research, business/​military/​political strategy, engineering, and persuasion/​manipulation)

    • This is important because a system that’s only as good as a typical human, or worse, likely isn’t smart enough to outwit our attempts to control it

  • Agentic planning: The system (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world.

    • This is important to draw out explicitly—a system like GPT-3 may have extremely advanced capabilities in some sense, but is not an agentic planner, so I am not very scared of it.

  • Strategically aware: It models the effects of gaining and maintaining power over humans and the real-world environment.

    • This is necessary for it to be able to form plans to outwit humans and gain power in the real world—this is a special and necessary form of advanced, agentic planning

    • There are some proposals like STEM AI to create an AI that is only good at creating progress in technical fields like maths, physics and chemistry and doesn’t understand human psychology well, so it is less likely to be able to deceive us.

  • Power-seeking misalignment: There are situations where the system is incentivised to gain power to achieve its objectives, against our wishes

    • This is important because, ideally, aligned AGI would be powerful and useful enough to have all of the first three properties, but just not do anything bad with them.

Other criticisms:

  • When naively considered, this framework often implicitly thinks of intelligence as a mysterious black box that caches out as ‘better able to achieve plans than us’, without much concrete detail. Further, it assumes that all goals would lead to these issues.

    • The sections on Goals and Agency in Richard Ngo’s AGI Safety from First Principles do a good job of disentangling this.

  • This case has been around since well before deep learning came on the scene, and some implicit ideas in earlier versions of the arguments now seem less plausible:

    • ‘Expected utility maximiser’ does not seem to describe modern systems (or humans!) very well.

    • It was previously believed that systems, when they reached human level, would learn to edit their source code. And thus make themselves smarter, become better at editing their source code, etc, leading a rapid rise in capabilities from human level to vastly superhuman: an intelligence explosion. ML systems are different as they need to first be trained, which takes a lot of time and compute, making it less likely that there could be such a big discontinuity in capabilities, and making it less likely that we get caught by surprise.

    • See Tom Adamczewski’s discussion of how arguments have shifted

  • Ben Garfinkel has been a prominent critic of the public case for this model, and points out a range of other holes.

The work

  • Understanding the incentives and goals of the agent, and how the training process can affect these in subtle ways

    • Work on specification gaming (aka reward hacking) - how AI finds ways to optimise reward functions in unexpected and perverse ways

    • Causal influence analysis work from Tom Everitt and Ryan Carey—using causal influence diagrams to better understand how subtle details of the training process can significantly affect the resulting incentives of the agent

  • Limited optimization: Many of these problems inherently stem from having a goal-directed utility-maximiser, which will find creative ways to achieve these goals. Can we shift away from this paradigm?

    • Satisficers: Rather than making an optimizer striving to do as well as possible, make an agent trying to do ‘good enough’

      • Jessica Taylor’s Quantilizers are a cool formalisation of this—find the best policy that doesn’t deviate too much from what a human would do

    • Imitation: Train agents to imitate humans. A human wouldn’t try to take over the world (probably), so an excellent imitator wouldn’t.

    • Myopic agents: Give an agent an inherently time-bounded goal, eg ‘maximise reward over the next minute’. This time scale is too short for large scale planning, deception etc to make sense, but may still be useful for us.

  • Aligning AIXI: AIXI is a theoretical ideal of a Bayesian reinforcement learning agent, and still has these problems of instrumentally convergent goals and power-seeking behaviour. So a theoretical angle of work is to try defining an aligned version of AIXI, and proving that this works. We can think of any RL system as an approximation to AIXI, and wouldn’t expect an approximation to an unaligned ideal to be aligned itself, so solving this could be significant progress.

    • Michael Cohen does good work here.

      • (Conflict of interest note: I interned under Michael in late 2020)

Sub-Threat model: Inner Alignment

A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization.

I think this is extremely important but notoriously hard to get your head around. Accessible overviews: Rob Miles, Rafael Harth. Sources to learn more: Evan’s interviews on FLI and AXRP, the Risks from Learned Optimization paper.

The Case

We first begin with the analogy of humans and evolution: From a certain point of view, evolution is an optimization process that searches over the space of possible organisms and finds those that are good at reproducing. Evolution eventually produced humans, who are themselves optimizers, and we care about a range of goals, such as status, pleasure, art, knowledge, writing posts for the Alignment forum, etc. And in the ancestral environment, pursuing these goals resulted in significant reproductive success. But in the modern world we continue to optimize our goals, yet totally fail to maximise reproductive success, eg by using birth control. Thus, from the perspective of evolution, humans are misaligned.

The key feature of the setup here, is that we had a base optimizer, evolution, an optimization process searching over possible systems according to how well they performed on a base objective, reproductive success. And this base optimizer eventually found a system, humans, that was itself optimizing. Humans are an example of a mesa-optimizer, an optimizing system found by a base optimizer, and humans are pursuing mesa objective(s).

The core problem is that the base objective (reproductive success) and the mesa objective (status, pleasure, etc) are not the same. This happened because evolution only cares about the performance of a system in the ancestral environment, rather than what the system’s mesa-objective truly is. And there are many possible mesa-objectives that will lead to reproductive success in the ancestral environment, but may lead to totally different outcomes in other environments—as happened with humans.

This setup is similar to modern deep learning: we search over possible neural networks weights with stochastic gradient descent (SGD), the base optimizer, according to our loss function, the base objective. And, further, SGD only pays attention to a network’s performance on the loss function on the training data. And pays no attention to how a network actually works. So the concern is that deep learning may result in neural networks that are optimizing systems pursuing mesa-objectives, but have no way of ensuring these objectives are the same as the base objective.

This concern introduces significant further complexity into the alignment problem. Optimization is scary, and a highly capable system pursuing an objective misaligned with ours will likely lead to bad outcomes. But with mesa-optimizers, we have two objectives: the base objective, and the mesa-objective. So we need to both solve the outer alignment problem, ensuring the base objective is aligned with human values, and the inner alignment problem, ensuring that the mesa-objective is the same as the base objective.

A key feature of the inner alignment problem is that the base objective underdetermines the mesa-objective. Our main tool for reasoning about the outcome of training a neural network is evaluating which parameters lead to good performance on the training data. This tool breaks down here, as there are likely many mesa-objectives that perform well on the base objective on the training data, some of which will be aligned (as in, they generalise safely to new environments), some of which will not be. So the key question is which mesa-objective we will end up with.

In practice, if we end up with mesa-optimizers, they will have performed well on the base objective on the training data. There are many ways this could happen, here are three of the most important:

  • Inner alignment: The mesa-objective and base objective are the same

  • Proxy alignment: The mesa-objective is a good proxy for the base objective on the training data, but looks different in other environments. This is what happened with humans and evolution.

  • Deceptive alignment: As an instrumentally convergent goal, the mesa-optimizer learns to deceive the base optimizer. It learns what the base objective is, and perfectly optimizes the base objective while being trained. Because the system performs well in training, it is eventually deployed, but once deployed it commits a treacherous turn.

    • This is a particularly scary outcome, because proxy alignment may be fairly easy to notice with sufficient testing, eg running the system in a wide range of simulated environments. Noticing deceptive alignment requires us to be able to trick the mesa-optimizer into thinking it is no longer in the training environment, which may be much harder.

I think there is the seed of an important idea here, but a lot of the discussion seems divided and confused, especially regarding what terms like optimizer actually mean. (See eg Evan Hubinger’s Clarifying Inner Alignment terminology). And while humans fail inner alignment, humans do not seem like an expected utility maximiser. Personally, I’m not convinced that we will ever produce neural networks that act like expected utility maximisers.

Another framing that side-steps the question of defining optimization is the 2D model of robustness. When we successfully train a model to act in an environment, it will take purposeful actions to achieve the intended objective. But when we shift the model to a different environment, there are three things that can happen. It may fail to take any purposeful actions at all, it may take purposeful actions but not towards the intended objective (its capabilities have generalised but its objective has not) or it may take purposeful actions towards the intended objective (its capabilities and objective have generalised). This breaks the question of ‘does the model successfully generalise?’ into the questions of ‘does the model’s capabilities generalise?’ and ‘does the model’s objective generalise?’. This is a helpful distinction, because causing an existential catastrophe is really hard, and so is much more likely to occur from an agent taking purposeful actions and capable of planning.

Personally, I find the inner misalignment threat model to be incredibly compelling, and it was a major factor in my decision to work on interpretability. But I’m not necessarily convinced by any of the more specific framings, eg specific narratives of what a mesa-optimizer might look like. My best attempt to distill out the core argument is as follows:

  • The main thing that determines the parameters output by a neural network training process is that these parameters encode a function that has good performance on the training data according to the given objective

  • Underdetermined: There are many possible sets of parameters that all result in similarly good performance

    • This seems trivially true—neural network parameters lie in an incredibly high-dimensional space

  • Underdetermined cognition: There are networks implementing importantly different underlying algorithms, which result in similarly good performance, and which could all be output by network training. We can think of these algorithms as the networks “cognition”

    • This is much less obvious, since we understand little about the actual algorithms networks are running, but seems plausible

  • Possibility of misaligned cognition: The cognition of the network can be considered to have an objective, and some possible objectives are misaligned yet still result in good performance on the training objective

    • This seems plausible in theory—a deceptively aligned mesa-optimizer could probably be implemented in a sufficiently large and complex network, and would probably perform well on the training objective

  • Plausibility of misaligned cognition: It is likely that, in practice, we will end up with networks with misaligned cognition

    • This one is a total unknown—because the cognition is underdetermined, we don’t know how likely misaligned vs aligned cognition is!

    • Plausibly misaligned cognition is very unlikely, because it is a really hard problem to realise you’re a network being trained, and to form sophisticated models of the world that let you deceive your trainers

    • Plausibly misaligned cognition is really likely—if the network needs a sophisticated world model anyway to solve the task and forms an objective fairly randomly, then most possible objectives may be instrumentally incentivised to deceive the operators, and capable of doing this

The work

This is a new and fairly poorly understood problem—it’s not even obvious that we will get mesa-optimizers—so I divide the work into understanding the problem and solving the problem.


  • Better understanding how and when mesa-optimization arises (if it does at all).

    • Eg, researching which training processes make mesa-optimization more or less likely to occur.

  • Empirical data: Actually making concrete examples of mesa-optimizers

    • It’s very hard to tell whether a neural network is actually an optimizer, so the main currently tractable approach is gathering empirical data of models whose capabilities generalise but whose objectives don’t.

  • Inductive biases: A neural network is a parametrisation to a space of functions. There are many different functions that all fit the training data equally well, but perform differently outside of the training data. To understand what happens, a key problem is understanding the inductive biases—when we train a network, we know we’ll end up with weights with good performance on the training data, but how does a network choose which weights with good performance to return? If some weights give mesa-optimizers and others don’t, which will be output?

    • For discussion of under-specification and how this affects modern ML, see this paper.

    • Ameya Prabhu recently wrote up an overview of what we know about inductive biases in neural networks.

    • A key part of the puzzle is deep double descent—in classic statistics, having more parameters means worse test set performance because you overfit. In deep learning, this trend eventually starts to reverse, when you have lots of parameters performance starts to improve again.

      • My rough intuition is that when there are lots of parameters, there are many possible functions. The inductive biases of a neural network favour simpler functions, and simpler functions generalise better to the test set, so having more functions to choose from leads to simpler functions which generalise better. But this is poorly understood, and that intuition could be totally off.

    • An intuition I find compelling is thinking of neural networks as Bayesian—that they have a prior over possible network parameters, and update this prior based on how well network parameters fit the training data. Under this model, understanding inductive biases reduces to understanding the neural network prior. In reality, neural networks learn by stochastic gradient descent, not by Bayesian updating, but they can be roughly modelled as sampling their trained parameters from the posterior distribution over the space of possible parameters.

    • There are also theoretical angles here. Eg, does AIXI have mesa-optimizers? - intuitively, if we search over all possible computer programs for ones that perform well on our loss function, will we end up with computer programs that are mesa-optimizers?

  • Abram Demski has written a research agenda for formally defining and understanding Inner Alignment


  • Interpretability looks particularly good here, anything to make models less black box, so we could look inside and see what the model is optimizing. This both lets us check whether there is a mesa-optimizer and whether its objective is aligned, and hopefully to develop and verify solutions to this problem.

    • I discuss actual work on interpretability in the section on robustly good approaches (forthcoming post)

    • One way you might try to solve this is with black box interpretability techniques, essentially creating a test set to catch bad behaviours. Eg simulating a situation where the model can’t be stopped, and seeing if it commits a treacherous turn. Some argue that this approach is limited as there is inevitable distributional shift between the test set and the real world, and that a deceptively aligned mesa-optimizer could act perfectly aligned until seeing an unfakeable sign of being out of training, meaning we need better interpretability techniques.

      • One example of this from Paul Christiano: RSA-2048 is a known number that is the product of two extremely large but unknown prime numbers. We do not know the factorisation, and do not have the computing power to calculate it now, but likely will in the next 100 years. So ‘have I seen a factorisation of RSA-2048?’ is an example of unfakeable evidence that the model is deployed in the real world, and not in a simulation.

    • For more surface area on how interpretability tools could be helpful here, I recommend skimming Evan Hubinger’s 11 Proposals for Safe Advanced AI—for each proposal, he discusses how inner alignment might be solved

  • This is highly related to robustness, especially the problem of distributional shift. It’s fundamentally about systems that behave differently in training and the real world (discussed more in the forthcoming section on robustly good approaches)

You get what you measure

This is the threat model outlined in What Failure Looks Like (Part 1) by Paul Christiano. I found the post insightful, but also somewhat cryptic, and found these clarifications from Ben Pace and Sam Clarke helpful. Paul Christiano’s Another (outer) alignment failure story is another story outlining a related threat model, which I also found helpful. Anecdotally, some researchers I respect take this very seriously—it was narrowly rated the most plausible threat model in a recent survey. This case has been less fleshed out than those above, so the following is more my attempt to flesh out and steelman the case and less focused on summarising existing work.

The case

Reinforcement learning systems are great at optimizing simple reward functions in clever and creative ways, and are getting better at optimizing all the time, but we struggle to optimize complex reward functions, and are seeing much less progress there. As AI systems become more influential on the world and a bigger part of the global economic system, we will want them to achieve complex and nuanced goals, as human values are complex and nuanced. Assuming that we remain much better at achieving simple rewards, this means we will need to approximate our true goals and define a proxy goal for the system. And if enough optimisation power is applied to a proxy goal, eventually these imperfections will become magnified, resulting in potentially catastrophic outcomes.

This pattern of simple, easy-to-measure proxies to achieve complex goals is widespread (the formal jargon is Goodhart’s Law). For example, GDP is often used as an easy-to-measure proxy for measuring prosperity—this can work fairly well, but misses out on major components such as life satisfaction. Or, academia is intended to be a system to produce good science and advance human knowledge, by incentivising academics to publish rapidly, get many citations and publish in high-impact journals (this one often fails).

The notion of simple vs complex reward functions is doing a lot of work here, and is hard to define explicitly. Intuitively, I think of simple as “easy to measure”—could I give a system lots of samples from this reward function while training? In practice, reinforcement learning systems are often trained from very easy to measure functions, such as the score in a video game. It may be possible to train a system on more complex rewards, eg by having it directly ask a human for feedback, but we need systems trained on these complex rewards to also be competitive with systems trained on simple rewards—can we get comparable performance at comparable cost?

This phenomenon is not specific to AI, the world is already heavily shaped by systems optimising simple proxies, eg corporations maximising profit, and this is not (yet) an existential catastrophe. So why be concerned about AI?

One major reason that this is not currently a catastrophe is that society shapes and updates these proxies as the imperfections become clear through tools such as regulation. For example, ‘maximise profit’ is a bad proxy for ‘make society better’ as it doesn’t account for costs to third parties such as pollution. But we live in a world which is far less polluted than it could be, thanks to taxes and laws about pollution.

But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.

Pace: How rapidly is the technology being developed and deployed? When trying to react to and regulate new technologies, it is much harder when things are moving at a fast pace—when things are slow, you have more time to react, coordinate, learn from failures, etc. For example, governments are having a really hard time regulating new technologies like drones and social media. AI is developing extremely rapidly even today, and if it becomes a significant fraction of global GDP this could plausibly be much worse, as far more resources will be put into it. (Note: This is not an argument for discontinuous/​fast takeoff, a ‘slow’ takeoff would still likely be very hard to respond to. (Discussed more in the forthcoming key considerations post))

Comprehensibility: Can we see what the system is doing and why? If so, it’s much easier to identify problems and notice them early. For example, regulating recommender systems is particularly hard because it’s hard to tell how the algorithm is making decisions, eg concerns around the Facebook algorithm radicalising people. A related point is that when there is a problem that will require coordination and decisive action to solve, this is much easier with legible, uncontroversial and early evidence. For example, smoking is terrible for you, but it took a long time to realise this and discourage use because the link to lung cancer is noisy and acts on long time horizons. AI is currently mostly an incomprehensible black box, and will likely remain that way without significant progress in interpretability.

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world. See Sam Clarke’s excellent post for more discussion of examples of lock-in.

So, how bad is all this? My personal take is that an inappropriate focus on optimising metrics is clearly already happening in the world today, is causing many bad effects (and many good ones!) and that AI will plausibly make this significantly worse. But it is highly unclear that this actually results in existential risk. Maybe the AIs will cause terrible collateral damage to eg the atmosphere or drinkable water (see discussion), maybe they’ll never cause a catastrophe but result in the lock-in of suboptimal values, maybe they’ll cause a bunch of short-term damage but we’ll manage to fix things. It’s very unclear! As a brief aside, I’ve also updated in favour of outcomes like this over the course of the COVID-19 pandemic—as of the time of writing, there are numerous examples of things I consider to be obvious errors that have been left unfixed for a while (not widely using fluvoxamine, not preparing more for future pandemics, etc)

The work

  • One of the most promising directions I’ve seen to this is directly training systems to get feedback from human operators, and learn to optimise for that feedback—this lets the reward signal be anything that humans can judge, and allows for much more complex rewards. This is known as deep reinforcement learning from human feedback.

    • A core difficulty here is that humans are expensive, and ML systems need a lot of data, so we need to become more data-efficient. One approach is to create a reward model based on small amounts of human feedback, train the system by querying the model, and asking the human for feedback on the most uncertain data points. An important paper here is Deep RL from Human Preferences, which managed to use just 900 bits of human feedback to teach a (humanoid) noodle to backflip

    • The OpenAI Alignment team has continued to do great work here, with two papers on teaching language models to summarise text based on human feedback—it’s very difficult to code a good reward function for ‘is this text a good summary’ but you know it when you see it, so this was a meaningful advance on what we can do with simple reward functions.

      • One interesting finding here was that that the quality of human feedback really matters—performance went up significantly when they worked more closely with contractors to explain the task, and paid by the hour rather than by the summary (incentivising more thorough work)

      • Another interesting finding was that if the system tries to optimise for its reward model, it will overfit and produce garbage. But if we regularise by constraining it to not deviate too much from a known OK policy, and then optimise for the reward model, it does great.

    • A more recent and more directly useful advance is the OpenAI Instruct Series—versions of GPT-3 fine-tuned to be good at following instructions and being helpful (currently available on the Beta API)

    • One limitation of human feedback work is scalable oversight, using it for tasks that are too long or complex for a human to easily evaluate. OpenAI has recently explored extending these techniques to summarise books, as a simple example task that’s difficult to get human feedback for.

    • Anthropic (my employer) has also recently released work exploring the strengths and limitations of these techniques to create helpful, honest and harmless AI

  • Jan Leike (head of the OpenAI Alignment team), has an agenda of recursive reward modelling: This is a more powerful technique than just reward modelling, and resolves the problem of modelling complex reward functions by recursively breaking the problem down into smaller pieces, training reward models for each of those, and combining them into a reward function for the original problem. This is a particularly exciting approach, because as capabilities advance we could use this to create better reward models and thus safer systems. (See his paper and interview for more details)

    • Arguably, recursive reward modelling is a fully-fledged agenda to create aligned AGI. I categorise this as a way to help with this specific threat model, because I expect better reward models to be helpful across many paths to AGI

  • A particularly concerning failure mode of human feedback is producing AIs that lie to us—that tell us what we want to hear, rather than what is actually true. There has recently been interest in building a field of Truthful AI, of understanding how to create AI that will not lie to us

    • In Truthful AI Owain Evan operationalises this idea, and argues for its importance

    • In Truthful QA Stephanie Lin creates a benchmark to measure the truthfulness of language models, to better enable and measure research here

    • In WebGPT Jacob Hilton explores a practical way to create more reliable and truthful language models by giving it access to a web browser, and training it to cite its sources.

    • A particularly concerning aspect of training honesty via human feedback is that there we all have biases, blind spots, and incorrect beliefs. This means that we will often reward model behaviour that conforms to our biases, and punish behaviour that goes against them. If a model is capable enough to know better, honesty is actively disincentivised. Paul Christiano explores this problem from a theoretical lens (see second summary here).

  • Ajeya Cotra recently wrote up an agenda for this kind of work on aligning narrowly superhuman AI. Cutting edge systems such as GPT-3 are interesting, because alignment (in the sense of ‘getting the system to do what we want’) is becoming a bottleneck on the tasks we can use the system for, rather than capabilities (in the sense of ‘what does the system know how to do’). For example, GPT-3 has much better medical knowledge than most doctors, but lies all the time, so currently cannot be used to safely give medical advice—this is an alignment problem, not a capabilities problem. The agenda argues that we should take these systems, and work to get them using their existing capabilities to their fullest extent, and learn how to train the system to do fuzzy and hard-to-specify tasks. And that this will likely give us good feedback on which alignment techniques actually work, make current systems safer, and make discontinuities in capabilities less likely.

    • The main work I have seen on this is deep RL from human feedback, but I can imagine other directions being fruitful!

    • I feel excited about this research direction, and as there is much more short-term economic incentive for this work than most alignment researcher, I hope there will soon be good work from non-longtermist researchers on it (though see Ajeya’s case for why this is not that economically incentivised)

  • There’s a bunch of great work from non-longtermists on this front, e.g. the field of explainable AI, and the field of fairness and algorithmic bias seem highly relevant, though I know less about them

AI caused coordination failures

This is the case I’ve seen most pushed by Andrew Critch, David Krueger and Allan Dafoe. Critch and Krueger discuss it in their ARCHES paper, and Critch discusses it on the FLI podcast and in What Multipolar Failure Looks Like. Allan Dafoe discusses a related notion in Open Problems in Cooperative AI and this is a focus of his new foundation the Cooperative AI Foundation. There hasn’t been that much work fleshing out this case, and I don’t understand it as well as I’d like to, so the following is my interpretation and my best attempt to steelman the core ideas, rather than solely my attempt at a summary. I am much less confident in this section than the previous ones

The Case

This threat model stems from a worldview that sees cooperation and coordination failures as a fundamental lens through which to understand the world. Cooperation is hard and unstable, and coordination failures are the default state of the world, yet successful cooperation is the root of much of the value in the world. The concern centres around AI destabilising the current institutions and norms that enable cooperation, and causing coordination failures. (Terminology note: Cooperation here can encompass cooperation between humans and humans, between AIs and AIs, and betweens humans and AIs)

I see this less as a single coherent case and more as a general prior that cooperation is hard yet crucial, and that destabilisation will be bad. There are a bunch of specific points and stories, but I think you can disagree with those while buying the overall case.

Some rough intuitions for a cooperation-centric worldview: Cooperation is unstable, because this involves many actors working together, where each actor is self-interestedly incentivised to defect, in a way that causes costs to others. Enough actors are self-interested that you need good institutions and norms to avoid them defecting. And most of human history is defined by being in a perpetual state of war and conflict. In modern times, some coordination failures have been extremely bad, eg WW1 and WW2, climate change, air pollution, etc. While when we can get cooperation right, eg trade, peace, well functioning governments, etc, this is responsible for a lot of the progress humanity has made.

So, why would AI make cooperation worse/​harder?

  • Pace: AI systems will likely develop rapidly in capabilities, giving us less time to get used to them. And they will likely be able to think much faster than humans. Much of the cooperation in the world today comes from norms and institutions that are slow to develop and take time to build, this is much harder in a rapidly moving world.

    • This may be especially bad if slow-moving human bureaucracies are responsible for cooperation and building norms and institutions, and these are not speeding up along with AI

  • Destabilisation: A world with AI will be very different to the world today. This likely means that different people, countries and institutions will have power. We are currently in something like a stable and reasonably cooperative equilibrium, but because cooperation is really hard, there is no reason to expect that a world with AI will be as cooperative as today’s world, because worlds as cooperative as today’s world are rare and special.

    • In particular, if established power structures are overturned, then many entities could get a lot of power if they act decisively, which encourages reckless behaviour from many actors. Unlike today’s world, with a small handful of established superpowers (China, USA, Russia, etc), and most countries having no realistic path to superpower status

    • Another angle on this is if AI and technological progress changes the offence-defence balance of technology. Eg, if it turns out to be extremely easy to create tiny autonomous drones to assassinate people, and really hard to defend against this. Or if AI makes it much easier to create extremely persuasive arguments and videos (a la deepfakes) and this is hard to defend against, this may create a breakdown of trust and public discourse

  • Transfer: Institutions that work well on humans may fail to transfer to AIs, eg it’s unclear what successful law enforcement would look like on an autonomous system that can replicate itself.

    • We already have some taste of this with cybercrime, a result of the new capabilities from computers and the internet—our current criminal/​legal institutions struggle to prevent and prosecute Russian or North Korean hackers attacking entities in the West.

Maybe you agree that cooperation would be harder, and that this would be bad. But would this lead to an existential risk? I personally find this fairly unclear, and don’t feel very compelled by any particular story, but I find it plausible that this could lead to extremely bad outcomes. See Section 3 of ARCHES and What Multipolar Failure Looks Like for more discussion.

One significant risk centres around collateral damage, the side effects of the coordination failures cause damage to eg the atmosphere or drinkable water and this causes humanity to die out. The underlying intuition here is one of human fragility—there is a large range of possible ways the Earth could be (temperature, composition of the atmosphere, etc) that could lead to machines thriving, while humans need a fairly specific environment. This means that unless AIs make a special effort to keep the Earth a good place for human life, and care highly about this, this will likely be expensive to maintain, and we should not expect this to go well by default. This is an important argument, because it does not require an AI system to be an agent actively optimising for harming humans, or even for the resulting ecosystem of AIs to be viewable as coherently optimising anything at all.

This feels related to the concerns outlined in ‘You get what you measure’, but different. Before, AI systems cause collateral damage because the damage was instrumentally useful to their goals, and they weren’t programmed to care about the harms. Here, each agent may care about the harms, but not enough—if each agent only plays a small marginal part in the coordination failure, they may not be incentivised to change it. This is analogous to how, today, each country may find it valuable to burn fossil fuels, and accept the cost of their marginal contribution to climate change, even though most countries makes a net loss on the total benefits and costs of global fossil fuel usage and climate change—the issue is that each country captures most of the benefits of their actions and a small fraction of their costs.

Another risk is that when multiple AI systems are interacting, their interactions can cause unexpected feedback loops that bring them far outside of their training distribution, resulting in extreme and unexpected behaviour. One mundane example of this is the 2010 Flash Crash where interactions between badly programmed stock market trading bots resulted in a crash, which wiped trillions of dollars of value in minutes before recovering. A more speculative version of this raised by Critch is the Production Web, where the entire economy becomes automated and dominated by AIs, which build on each other and cause a great deal of growth, but does not cache out as morally relevant things such as increased human welfare, and ceases to be human comprehensible. This seems particularly concerning with new ideas coming out of the crypto world as Decentralized Autonomous Organizations, which could result in an economy where most resources are not, ultimately, controlled by humans, which could become totally out of control. Nick Bostrom described the ultimate outcome of this kind of process run wild as ‘a Disneyland with No Children’.

A useful framework introduced in ARCHES for thinking about AI is that of a delegation problem—operators make AI systems, and want the AI systems to act to achieve their values. This delegation problem is very different if there is one or many operators, and one or many AIs, giving four different scenarios:

  • Single-single—The classic conception of the alignment problem, where a single operator wants to align a single AI with their values. This is already extremely hard!

  • Single-multi—A single operator is using many AI systems, and wants the resulting behaviour to achieve their values. This is a different problem from single-single alignment, because we know from game theory that a system of rational agents acting autonomously rarely maximises total value (eg the prisoner’s dilemma), and this often needs better coordination mechanisms

  • Multi-single—Many operators want to use a single AI system to achieve their values. Just figuring out what the AI system should be doing is difficult! (Value aggregation is a notoriously hard problem)

  • Multi-multi—the problem of many operators, and many interacting AI systems, where each operator likely has many AIs of their own, and we want the overall system to result in good outcomes. This sounds extremely hard!

One insight from this framing is that we should expect the multi-multi delegation problem to be neglected. The Alignment problem, as normally conceived, is single-single delegation, and it is plausible that the creators of AGI will put significant resources into solving it (though likely not enough!), since it is clearly their responsibility. But no one has a clear responsibility to solve multi-multi delegation, and far fewer resources are invested into it today. Yet, given how much greater train-time compute is than run-time compute for modern ML, it seems likely that once we have trained AGI, we will run many copies of it, and plausible that many different actors will have access to AGI. This means that the multi-multi delegation problem will become relevant at almost exactly the same time as single-single, and is plausibly a much harder problem.

The Work

I expect much of the useful work here to be policy centred, eg creating good institutions, regulations and norms around AI use, incentivising international cooperation, etc. But the proponents also argue that there is important technical work to be done to create AI agents better able to cooperate, which is what I’ll focus on here:

  • A lot of existing mainstream ML research seems relevant to this. In ARCHES and an accompanying blog post Critch ranks different subfields of ML and safety according to their relevance to enabling cooperation.

    • Note: I expect the rankings in this post to be controversial, and are just one researcher’s opinion, and you shouldn’t update too much from just that post. I would give quite different numbers

    • Personally, I was surprised by how highly he ranks fairness in ML, computational social choice theory and accountability in ML

  • Open Problems in Cooperative AI is a recent paper trying to define a new research agenda of creating AI systems that better enable cooperation.

    • One notable thing about this paper is that most of the authors are, to my knowledge, DeepMind researchers who normally focus on Multi-Agent RL work, rather than alignment.

      • This suggests that there may be research communities uninterested in standard Alignment questions, who could get excited about doing research relevant to avoiding AI-caused coordination failures, which is exciting from a field-building perspective.

    • This has since led to the Cooperative AI Foundation for encouraging and funding this research, and a Nature editorial

  • An interesting research direction is formalising cooperation, what it means and how it should work. Critch is interviewed by Daniel Filan on AXRP about a related paper on negotiable RL

  • The Centre for Longterm Risk works on similar issues around game theory and encouraging good bargaining and cooperation between AIs, as summarised in this post. I am not very familiar with their work, but as I understand it they focus on avoiding breakdowns of negotiation between AI systems.

    • The centre is focused on avoiding risks of astronomical amounts of suffering (s-risks.) This leads them to be interested in cooperative AI because one way large amounts of suffering could be created is as a result of conflict between AIs.

  • I am fairly unfamiliar with this area, and have likely missed relevant things. The Alignment Newsletter summarises a lot of work to do with Handling Groups of Agents, and you can skim through that to get a sense for what other kinds of work happens

NOTE: This was intended to be a full sequence, but will likely be eternally incomplete—you can read a draft of the full sequence here.