Modeling the impact of safety agendas

This post, which deals with how safety research—that is, technical research agendas aiming to reduce AI existential risk—might impact risks from AI, is part 6 in our sequence on Modeling Transformative AI Risk. In this series of posts, we are presenting a preliminary model of the relationships between key hypotheses in debates about catastrophic risks from AI. Previous posts in this sequence explained how different subtopics of this project, such as AI takeoff and mesa-optimization, are incorporated into our model.

We caution that this part of the model is much more of a work in progress than others. At present, it is best described as loosely modeling a few aspects of safety agendas, and the hope is that it can be further developed to be similar quality to more complete portions of the model. In many cases, we are unclear about how different research agendas relate to specific types and causes of risk. While this is partly because this part of the model is a work in progress, it is also because the theory of impact for many safety agendas is still unclear. So in addition to explaining the model, we will highlight things that are unclear and what would clarify those things.

Modeling of different research areas is contained in the green-colored modules, circled below.

Key questions for safety agendas

We encountered several uncertainties about safety agendas which have made modeling them relatively difficult. While most of these uncertainties are explained throughout the next section, the following points seem like the most important questions to ask about a safety agenda:

  1. What is the theory of change? What does success look like, and how does that reduce AI risk? What aspects of alignment is the agenda supposed to address?

  2. What is assumed about the progression of AI? How much does the agenda rely on a particular AI paradigm?

  3. What is the expected timeline of the research agenda, and how much does that depend on additional funding or buy-in?

  4. What are likely effects from work on this agenda even if it doesn’t succeed fully? Are there spillover effects for AI capabilities or for other safety agendas? Are there beneficial effects from partial success?

Besides just helping our model, it’s worth us highlighting all the benefits of understanding the above for various safety agendas:

  • the community can better understand what constitutes success and provide better feedback on the agenda,

  • the researchers get a better idea of how to steer the research agenda as it progresses,

  • funders (and researchers) can better evaluate agendas, and in turn prioritise funding or pursuing them,

  • future research can find gaps in the safety-agenda space more easily.

Model overview

Overall impact of safety research

The key outcome to focus on for our model of research impact is Misaligned HLMI[1], circled in purple in the figure below. This node is obviously important to the final risk scenarios that we model, which will be covered in more detail in the next post. Looking at the inputs to this node, the model says we can avoid Misaligned HLMI if either

  1. HLMI is never developed (blue-circled node), or

  2. We manage to Correct course as we go, meaning HLMI is either aligned by default, or HLMI can be aligned in an iterative fashion in a post-HLMI world (orange-circled node), or

  3. HLMI [is] aligned ahead of time—that is, people find a way to align HMLI before it appears or needs to be aligned (green-circled node).

While it seems that most people in the AI safety community believe option 3 is necessary for safe HLMI, option 2 is argued by some mainstream AI researchers and by some within the AI safety community, and option 1 would apply if humanity successfully coordinated to never build HLMI (though this possibility is not currently captured by our model). We will discuss option 2 further in the upcoming post on failure modes.

If we assume that aligning HLMI ahead of time is a worthwhile endeavour, i.e. conditioning on the model of the risks, how would success be possible? It would either be through the success of currently proposed research agendas and any of their direct follow-ups, or through more novel approaches developed as we get closer to HLMI. New approaches may come from new insights or from a paradigm shift in AI. This is essentially what the HLMI aligned ahead of time module is capturing, shown below. In particular, this module currently includes Foundational research (specifically the highly reliable agent designs agenda) and Synthesized utility function. We are currently including these two here for simplicity, but other agendas should fit in as well.

Our current model focuses primarily on currently proposed agendas since we can, and want to, model them more concretely. Additional work on both the current agendas, and the potential for future progress, would be useful in improving our understanding of the risk and building the model.

In this post we’ll focus on three (or perhaps two and a half) different approaches to safety; 1) Iterated Distillation and Amplification, 2) Foundational Research, and 3) Transparency, which has been proposed as a useful part of several approaches to safety, but may not be sufficient alone. In the following sections we go through our preliminary models of these agendas, and point out major uncertainties we have about them in bold.

Iterated Distillation and Amplification

We will use the IDA research agenda as the main example to explain our uncertainties about modeling, as it appears to have more published detail than most other proposed agendas regarding: what is involved, its theory of change, and its assumptions about AI progress. The section of the model for IDA is shown in the figure below.

The final output of this section is IDA research successful. We assume that success means obtaining a clear, vetted procedure to align the intentions of an actual HLMI via IDA. However, the IDA agenda seems to address outer alignment (i.e. finding a training objective with optima that are aligned with the overseer), and not necessarily inner alignment (i.e. making a model robustly aligned with the training objective itself)[2]. This is a case where we are uncertain what parts of the alignment problem the agenda is supposed to solve, and how a partial solution to alignment is expected to fit into a full solution.

For IDA and other agendas, it’s also difficult to reason about degrees of success. What if the research doesn’t reach its end goal, but produces some useful insights? Relatedly, how could the agenda help other agendas if it does not directly achieve its aims? We’re uncertain how researchers think about this, and how best to model these effects.

Continuing through the model, towards the right of the figure we have a section about the “race” between IDA research and AI capabilities research, summarised as Amplification research will produce useful results in time for HLMI. We’re particularly unsure how to think about timelines to solve research agendas, and we are interested in either community feedback about their understanding of the timelines for success, or any insights on the topic.

Currently, we model research timelines using the node IDA research sufficient by year X, where X is affected by both Investment effect and randomness. This result is then modified by the node Extra time, which models the possibility that a “fire alarm” for HLMI is recognised and speeds up safety research in the years before HLMI, either through insight or increased resources. This time to sufficient IDA research is then compared to the timeline for HLMI (Timeline: HLMI by year X), where success is dependent on the time to IDA being less than the time to HLMI.

The competitiveness of IDA is modeled in a section toward the left of the figure. We break competitiveness down into Not prohibitively expensive to train, Competitive at runtime, and whether IDA scales to arbitrary capabilities (Team of aligned agents can be more capable than an individual agent and reach arbitrary levels of capability). All are modeled as being necessary for IDA to be competitive.

Finally, we have one node to represent whether IDA [is] outer aligned at optimum. This means that all possible models which are optimal according to the training objective are at least intent-aligned. Being outer aligned at optimum has been argued to defuse much of the threat from Goodhart’s Law, specifically the Causal and Extremal variants of it. So IDA being outer aligned at optimum would be important for the IDA agenda to be successful at alignment, and therefore a key uncertainty.

Putting it all together, our model considers IDA to be a workable solution for outer alignment if and only if IDA research wins in a “race” against unaligned HLMI, and IDA is sufficiently competitive with other approaches of HLMI, and IDA is outer aligned at optimum. The output node IDA research successful then feeds into the Incorrigibility module at the top level of the model—that is, if it’s successful and we have an intent-aligned HLMI, then it is corrigible. Corrigibility in turn increases the ability to Correct course as we go (and so on, as explained in the previous section).

Foundational Research

Our work on modeling Foundational Research has focused entirely on MIRI’s highly reliable agent designs (HRAD) research agenda, as this has had the most discussion in the AI alignment community (out of all the technical work in Foundational Research). Trying to model disagreements about the value of the HRAD agenda led to the post Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate. To summarize the post, one of the difficulties with modeling the value of HRAD research is that there seems to be disagreement about what the debate is even about. The post tries to organize the debate into three “possible worlds″ about what the core disagreement is, and gives some reasons for thinking why we might be in each world. The discussion in comments did not lead to a consensus, so more work will probably be needed to make our thoughts precise enough to encode in the graph structure of the model.

The model below is a simpler substitute, pending further work on the above. Like the IDA model, this considers whether the research can succeed in time for HLMI. Besides that, there are two nodes about the possibility and difficulty of HRAD. The Foundational research [is] successful node feeds directly into the HLMI aligned ahead of time module shown previously.

Transparency

Here, we are using the term “transparency” as shorthand for transparency and interpretability research that has long-term AI safety as a core motivation.

Transparency can be applied to whole classes of machine learning models, and may be a part, or complement, of several alignment techniques. In An overview of 11 proposals for building safe advanced AI, “transparency tools” form a key part of several proposals, for different reasons. More recently, Transparency Trichotomy analysed different ways that transparency can help understand a model: via inspection, or training, or architecture. The parts of the trichotomy can also work together, for instance by using transparency via inspection to get more informed oversight, which then feeds back into the model via training. So the current structure of our model, with its largely separate paths to impact for each agenda, does not seem well suited to transparency research. However, the Mesa-optimization module does incorporate some nodes on how transparency research may help detect deception (via inspection) or actively avoid deception (via training).

Theories of change for transparency helping to align HLMI have become clearer in published writing over the last couple of years. The post Chris Olah’s views on AGI safety offers several claims on theory of change, including:

  1. Transparency tools give you a mulligan—a chance to recognise a bad HLMI system, and try again with better understanding.

  2. Advances in transparency tools feed back into design. If we build systems with more understanding of how they work, then we can better understand their failure cases and how to avoid them.

  3. Careful analysis using transparency tools will clarify what we don’t understand too. Pointing out what we don’t understand will generate more concern about HLMI.

  4. Transparency tools help an overseer to give feedback not just on a system’s output, but also the process by which it produced that output.

  5. Advances in transparency tools (and demonstrating their usefulness and appeal) helps realign the ML community to focus on deliberate design and understanding.

From the above claims and discussion elsewhere, there are several apparent cruxes for transparency helping to align HLMI:

  1. Using transparency tools will not make enough progress (or any progress) on the “hard problem” of transparency. The hard problem is to figure out what it even means to understand a model, in a way that can save us from Goodhart’s Law and deception. As discussed in Transparency Trichotomy, transparency tools can themselves be gamed. There seems to be agreement that transparency tools will not get us all the way on this problem, but disagreement about how much they help—see e.g. this thread and this comment (bullet point 3).

  2. Similar to the above, though we are not sure if this is a distinct crux for anyone: there is a risk that transparency tools make the flaws we are trying to detect harder to understand (discussed here and here), so there is too great a risk that the tools cause net harm.

  3. Transparency tools will not scale with the capabilities of HLMI and beyond—discussed here. The crux could be specifically about the amount of labour required to understand increasingly large models. It could also be about increasingly capable systems using increasingly alien abstractions. The linked post suggests that an amplified overseer could get around this problem, so the crux could actually be in whether an amplified overseer can make transparency scale reliably in place of humans.

  4. The available transparency tools will not be useful for the kind of system that HLMI is (e.g. the work on Circuits in vision models will not transfer well to language models). This is like a horizontal version of the above scaling crux. Chris Olah raised this point himself.

Again, more work is needed to structure our model in a way that incorporates the above cruxes.

Other agendas

Other agendas or strategies which we have not yet modeled include:

Some of the above are more difficult to model because there is less writing that clearly outlines paths to impact or what success looks like. A potentially valuable project is to make a clearer case to the community for how a given research agenda could be impactful, and explaining what the goals and specific approaches are.

Help from this community

Our tentative understanding suggests that more public effort to understand and clearly articulate safety agendas’ impacts, driving beliefs, and main points of disagreement would be really helpful. Examples of good work in this area are An overview of 11 proposals for building safe advanced AI, and Some AI research areas and their relevance to existential safety. This work can take a lot of effort and time, but some of the uncertainties highlighted in this post seem fairly easy to clarify through comments or smaller write-ups.

To illustrate the kind of information that would help, we have written the following condensed explanation of an imaginary agenda (the agenda and opinions are made-up—this is not quoting anyone):

This agenda aims to increase the chance that high-level machine intelligence (HLMI) is inner-aligned. More specifically, it will defuse the threat of deceptively aligned HLMI. The path from deceptively aligned HLMI to existential catastrophe is roughly: such a system would be deployed due to economic or other incentives and lack of apparent danger. It would also be capable enough to take the long-term future out of humanity’s control. While we have a very wide distribution over how humanity loses control, we expect a scenario similar to scenario 2 of What Failure Looks Like.

The specific outcome we are aiming for is <alignment procedure>. For this to succeed and be scalable, we rely on AI progressing like <current machine learning trends>. We expect the resulting AI to be competitive, with training time on the same order of magnitude and performance within 20% of the unaligned baseline.

With regard to timelines, we tentatively estimate that this work is at its most valuable if HLMI is produced in the medium term, neither in the next 10 years, nor more than 30 years from now. With our current resources, we give a rough 5% chance of having a viable procedure within 5 years, and a 10% chance within 20 years. This increases to 20% and 30% respectively with <additional resources>. The remaining subjective estimated probability of failure is split evenly between obstacles from the theory, or project management, or external factors. This agenda relies on outer alignment being solved using <broad outer alignment approach>, but otherwise not interacting much with the problem we aim to solve.

Finally, as a way of quickly gathering opinions, we would love to see comments on the following: for any agenda you can think of, or one that you’re working on, what are the cruxes for working on it?

In the next post, we will look at the failure modes of HLMI and the final outcomes of our model.

Acknowledgements

Thanks to the rest of the MTAIR Project team for feedback and suggestions, as well as Adam Shimi and Neel Nanda for feedback on an early draft.


  1. ↩︎

    We define High-Level Machine Intelligence (HLMI) as machines that are capable, either individually or collectively, of performing almost all economically-relevant information-processing tasks that are performed by humans, or quickly (relative to humans) learning to perform such tasks. We are using the term “high-level machine intelligence” here instead of the related terms “human-level machine intelligence”, “artificial general intelligence”, or “transformative AI”, since these other terms are often seen as baking in assumptions about either the nature of intelligence or advanced AI that are not universally accepted.

  2. ↩︎

    For more on this distinction/​issue, see this post.