MATS AI Safety Strategy Curriculum

As part of the MATS Winter 2023-24 Program, scholars were invited to take part in a series of weekly discussion groups on AI safety strategy. Each strategy discussion focused on a specific crux we deemed relevant to prioritizing AI safety interventions and was accompanied by a reading list and suggested discussion questions. The discussion groups were faciliated by several MATS alumni and other AI safety community members and generally ran for 1-1.5 h.

As assessed by our alumni reviewers, scholars in our Summer 2023 Program were much better at writing concrete plans for their research than they were at explaining their research’s theory of change. We think it is generally important for researchers, even those early in their career, to critically evaluate the impact of their work, to:

  • Choose high-impact research directions and career pathways;

  • Conduct adequate risk analyses to mitigate unnecessary safety hazards and avoid research with a poor safety-capabilities advancement ratio;

  • Discover blindspots and biases in their research strategy.

We expect that the majority of improvements to the above areas occur through repeated practice, ideally with high-quality feedback from a mentor or research peers. However, we also think that engaging with some core literature and discussing with peers is beneficial. This is our attempt to create a list of core literature for AI safety strategy appropriate for the average MATS scholar, who should have completed the AISF Alignment Course.

We are not confident that the reading lists and discussion questions below are the best possible version of this project, but we thought they were worth publishing anyways. MATS welcomes feedback and suggestions for improvement.

Week 1: How will AGI arise?

What is AGI?

How large will models need to be and when will they be that large?

How far can current architectures scale?

What observations might make us update?

Suggested discussion questions

  • If you look at any of the outside view models linked in “Biological Anchors: The Trick that Might or Might Not Work” (e.g., Ajeya Cotra’s and Tom Davidson’s models), which of their quantitative estimates do you agree or disagree with? Do your disagreements make your timelines longer or shorter?

  • Do you disagree with the models used to forecast AGI? That is, rather than disagree with their estimates of particular variables, do you disagree with any more fundamental assumptions of the model? How does that change your timelines, if at all?

  • If you had to make a probabilistic model to forecast AGI, what quantitative variables would you use and what fundamental assumptions would your model rely on?

  • How should estimates of when AGI will happen change your research priorities if at all? How about the research priorities of AI safety researchers in general? How about the research priorities of AI safety funders?

  • Will scaling LLMs + other kinds of scaffolding be enough to get to AGI? What about other paradigms? How many breakthroughs around as difficult as the transformer architecture are left, if any?

  • How should the kinds of safety research we invest in change depending on whether scaling LLMs + scaffolding will lead to AGI?

  • How should your research priorities change depending on how uncertain we are about what paradigm will lead to AGI, if at all? How about the priorities of AI safety researchers in general?

  • How could you tell if we were getting closer to AGI? What concrete observations would make you think we don’t have more than 10 years left, how about 5 years, how about 6 months?

Week 2: Is the world vulnerable to AI?

Conceptual frameworks for risk: What kinds of technological advancements is the world vulnerable to in general?

Attack vectors: How might AI cause catastrophic harm to civilization?

AI’s unique threat: What properties of AI systems make them more dangerous than malicious human actors?

Suggested discussion questions

  • How do ML technologies interact with the unilateralist’s curse model? If you were going to use the unilateralist’s curse model to make predictions about what a world with more adoption of ML technologies would look like, what predictions would you make?

  • How do ML technologies interact with the vulnerable world hypothesis model? Which type in the typology of vulnerabilities section do ML technologies fall under? Are there any special considerations specific to ML technologies that should make us treat them as not just another draw from the urn?

  • What are the basic assumptions of the urn model of technological development? Are they plausible?

  • What are the basic assumptions of the unilateralist’s curse model? Are they plausible?

  • How is access to LLMs or other ML technologies different from access to the internet with regard to democratizing dual-use technologies, if it is at all?

  • Are there other non-obvious dual-use technologies that access to ML technologies might democratize?

  • In Karnofsky’s case for the claim that AI could defeat all of us combined, what are the basic premises? What sets of these premises would have to turn out false for the conclusion to no longer follow? How plausible is it that Karnofsky is making some sort of mistake? (Note that Karnofsky is explicitly arguing for a much weaker claim than “AI will defeat all of us combined”).

  • Suppose that we do end up with a world where we have ML systems that can get us a lot of anything we can measure, would this be bad? Is it plausible that the benefits of such a technology could outweigh the costs? What are the costs exactly?

  • Optional: In “What failure looks like” Paul Christiano paints a particular picture of what a world in which the development and adoption of ML technologies goes poorly. Is this picture plausible? What are the assumptions that it rests on? Are these assumption plausible? What would a world with fast ML advancement and adoption look like if it turns out that some set of these assumptions are false?

Week 3: How hard is AI alignment?

What is alignment?

How likely is deceptive alignment?

What is the distinction between inner and outer alignment? Is this a useful framing?

How many tries do we get, and what’s the argument for the worst case?

How much do alignment techniques for SOTA models generalize to AGI? What does that say about how valuable alignment research on present day SOTA models is?

Suggested discussion questions

  • What are the differences between Christiano’s concept of “intent alignment” and Aribtal’s concept of “alignment for advanced agents”? What are the advantages and disadvantages of framing the problem in either way?

  • Is “gradient hacking” required for AI scheming?

  • What are the key considerations that make deceptive alignment more or less likely?

  • Is it likely that alignment techniques for current gen models will generalize to more capable models? Does it make sense to focus on alignment strategies that work for current gen models anyway? If so, why?

  • Suppose that we were able to get intent alignment in models that are just barely more intelligent than human AI safety researchers, would that be enough? Why or why not?

  • Why is learned optimization inherently more dangerous than other kinds of learned algorithms?

  • Under what sorts of situations should we expect to encounter learned optimizations?

  • Imagine that if you have full access to a model’s weights, and you have access to an arbitrarily large but finite amount of compute and time, how can you tell whether a given model contains a mesaoptimizer or not?

  • How is the concept of learned optimization related to the concepts of deceptive alignment or scheming? Can you have one without the other? If so, how?

  • Can you come up with stories where a model was trained and behaved in a way that was not intent aligned with its operators, but it’s not clear whether this counts as a case of inner misalignment or outer misalignment?

  • What are the most important points of disagreement between Eliezer Yudkowsky and Paul Cristiano? How should we change how we prioritize different research programs depending on which side of such disagreements turn out correct?

Week 4: How should we prioritize AI safety research?

What is an “alignment tax” and how do we reduce it?

What kinds of alignment research will we be able to delegate to models if any?

How should we think about prioritizing work within the control paradigm in comparison to work with the alignment paradigm?

How should we prioritize alignment research in light of the amount of time we have left until transformative AI?

How should you prioritize your research projects in light of the amount of time you have left until transformative AI?

Suggested discussion questions

  • Look at the DAG from Paul Christiano’s talk (you can find an image version in the transcript of the talk). What nodes are missing from this DAG that seem important to you to highlight? Why are they important?

  • What nodes from Christiano’s DAG does your research feed into? Does it feed into several parts? The most obvious node for alignment research to feed into is the “reducing the alignment tax” node. Are there ways your research could also be upstream of other nodes? What about other research projects you are excited about?

    • It might be especially worth thinking about both of the above questions before you come to the discussion group.

  • How does research within the control paradigm fit into Christiano’s DAG?

  • What kinds of research make sense under the control paradigm which do not under the alignment paradigm?

  • It seems like there may be a sort of chicken and egg problem for alignment plans that involve creating an AI to do alignment research, that is, you use AI to align your AI but you need the AI you use to align your AI to already be aligned. Is this a real problem? What could go wrong if you used an unaligned AI to align your AI? Are things likely to go wrong in this way? What are some ways that you could get around the problem?

  • Looking at Evan Hubinger’s interpretability/​transparency tech tree, do you think there are nodes that are missing?

  • It’s been six months since Hubinger publish his tech tree. Have we unlocked any new nodes on the tech tree since then?

  • What would a tech tree for a different approach, eg control, look like?

Week 5: What are AI labs doing?

How are the big labs approaching AI alignment and AI risk in general?

How are small non-profit research orgs approaching AI alignment and AI risk in general?

  • ARC: Mechanistic anomaly detection and ELK

  • METR: Landing page
    This is just the landing page of their website, but it’s a pretty good explanation of their high level strategy and priorities.

  • Redwood Research: Research Page
    You all already got a bunch of context on what Redwood is up to thanks to their lectures, but here is a link to their “Our Research” page on their website anyway.

  • Conjecture: Research Page

General summaries:

  • Larsen, Lifland - (My understanding of) what everyone is doing and why

    • This post is sort of old by ML standards, but I think it is currently still SOTA as an overview of what all the different research groups are doing. Maybe you should write a newer and better one.

    • This post is also very long. I recommend skimming it and keeping it as a reference rather than trying to read the whole thing in one sitting.

Suggested discussion questions

  • Are there any general differences that you notice between Anthropic, Deepmind, and OpenAI’s approaches to alignment or other safety mechanisms? How could you summarize these differences? Where are their points of emphasis different? Are their primary threat models different, if so, how?

  • Are there any general differences that you notice between how the big labs (eg, Anthropic, OpenAI) and smaller non-profit orgs (eg, ARC, METR) approach alignment or other safety mechanisms? How could you summarize those differences? Where are their points of emphasis different?

  • Can you summarize the difference between Anthropic’s RSPs and OpenAI’s RDPs? Do the differences seem important or do they seem like a sort of narcissism of small differences kind of deal?

  • What is an ASL? How do Anthropic define ASL-2 and ASL-3? What commitments do Anthropic make regarding ASL-3 models?

  • Would the concept of ASL still make sense in OpenAI’s RDP framework? Would you have to adjust it in any way?

  • Do you think that Anthropic’s commitments on ASL-3 are too strict, not strict enough, or approximately right? Relatedly, do you expect that they will indeed follow through on these commitments before training/​deploying an ASL-3 model?

  • How would you define ASL-4 if you had to? You as a group have 10 minutes and the world is going to use your definition. Go!

  • Ok cool, good job, now that you’ve done that, what commitments should labs make relating to training, deploying, selling fine tuning access to, etc, an ASL-4 model? You again have 10 minutes, and the world is depending on you. Good luck!

Week 6: What governance measures reduce AI risk?

Should we try to slow down or stop frontier AI research through regulation?

What AI governance levers exist?

What catastrophes uniquely occur in multipolar AGI scenarios?

Suggested discussion questions

  • The posts from lc, Karnofsky, and 1a3orn are in descending order of optimism about regulation. How optimistic are you about the counterfactual impact of regulation?

  • What are some things you could observe or experiments you could run that would change your mind?

  • 1a3orn’s post paints a particular picture of how regulation might go wrong, how plausible is this picture? What are the key factors that might make a world like this more or less likely given concerted efforts at regulation?

  • What are other ways that regulation might backfire if there any?

  • Regulation might be a good idea, but what about popular movement building? How might such efforts fail or make things worse?

  • If you got to decide when transformative AI or AGI will first be built, what would be the best time for that to happen? Imagine that you are changing little else about the world. Suppose that the delay is caused only by the difficulty of making AGI, or other contingent factors, like less investment.

  • Is your ideal date before or after you expect AGI to in fact be developed?

  • What key beliefs would you need to change your mind about for you to change your mind about when it would be best for AGI to be developed?

  • Under what assumptions does it make sense to model the strategic situation of labs doing frontier AI research as an arms race? Under what set of assumptions does it not make sense? What do the payoff matrices look like for the relevant actors under either set of assumptions?

  • Which assumptions do you think it makes the most sense to use? Which assumptions do you think the labs are most likely to use? If your answers to these questions are different, how do you explain those differences?

  • How do race dynamics contribute to the likelihood of ending up in a multipolar scenario Like the ones described in Christiano and Critch’s posts, if they do at all?

Week 7: What do positive futures look like?

Note: attending discussion this week was highly optional.

What near-term positive advancements might occur if AI is well-directed?

What values might we want to actualize with the aid of AI?

What (very speculative) long-term futures seem possible and promising?

Suggested discussion questions

  • If everything goes well, how do we expect AI to change society in 10 years? What about 50 years?

  • What values would you like to actualize in the world with the aid of AI?

  • If we build sentient AIs, what rights should those AIs have? What about human minds that have been digitally uploaded?

  • Are positive futures harder to imagine than dystopias? If so, why would that be?

  • In the “ship of Theseus” thought experiment, the ship is replaced, plank-by-plank, until nothing remains of the original ship. If humanity’s descendants are radically different from current humans, do we consider their lives and values to be as meaningful as our own? How should we act if we can steer what kind of descendants emerge?

  • What current human values/​practices could you imagine seeming morally repugnant to our distant descendants?

  • Would you hand control of the future over to a benevolent AI sovereign? Why/​why not?

  • We might expect that especially over the long term, human values might change a lot. This is sometimes called “value drift”. Is there a reason to be more concerned about value drift caused by AIs or transhumans than from human civilization developing as it would otherwise?


Ronny Fernandez was chief author of the reading lists and discussion questions, Ryan Kidd planned, managed, and edited this project, and Juan Gil coordinated the discussion groups. Many thanks to the MATS alumni and other community members who helped as facilitators!