Resources for AI Alignment Cartography
I want to make an actionable map of AI alignment.
After years of reading papers, blog posts, online exchanges, books, and occasionally hidden documents about AI alignment and AI risk, and having extremely interesting conversations about it, most arguments I encounter now feel familiar at best, rehashed at worst. This should mean I have a good map of the field being discussed.
I have been, however, frustrated by how little actual advice I could derive from this map. The message I understood from most agendas was “here are the tractable sub-problems we will work on and why they should be solved”. I didn’t find much justification for why they are critically important, or why one particular set of problems should be explored instead of the next research group’s set.
You’ll find, in the following sections, my (current) selection of:
“cartography work”, to draw a map of relevant arguments and concepts;
research agendas, from research groups or individuals;
points of entry for newcomers.
Here are the caveats. The list is not exhaustive. I did try to cover as many visible ideas as possible, and there will be significant overlap and cross-references between the items listed here. Some references I consider useful (e.g. this) have not made the cut. I attempted to categorize the resources by focus, but a handful could have ended up in a different category. Please don’t rely on it too much.
My comments aren’t summaries, rather justifications for why I included the reference. They also reuse liberally the original formulations. Please tell me if I left strong misrepresentations of the ideas in there.
All these references, and countless comments scattered all across LessWrong, the Alignment forum, and the Effective Altruism forum, will hopefully help me build something actionable, something that would let newcomers and experts explore the field with more clarity and make better decisions.
My short-term plan is to create minimal interactive explanations for the relevance of various propositions in AI alignment, with the option to question and expand their premises. I want to do this for a first few high-level ideas, and if it goes well, expand to a first full scenario.
The long-term plan is to map as many propositions and available scenarios as possible, to have a common framework in which to compare research directions. My intuition (to be challenged) is that there’s broad agreement in the field on most premises I could describe, and that we would benefit a lot from locating cruxes (e.g. here). My overarching motivation is to reduce research debt.
The references here will be my first source of information. The second one would be discussions. If you are the author of one of the resources below and/or if you had more conversations about alignment-related arguments than you can remember, and want to share your insights, please reach out to me. I will do my best to answer in a timely manner.
Thanks to Adam Shimi , Alexis Carlier and Maxime Riché for reviewing drafts of this post and suggesting resources!
Argument mapping & reviews
Richard Ngo—January 2019
Splits the core motivating arguments for AI safety into six rough categories: maximizers being dangerous, target loading, prosaic alignment, human safety, misuses/vulnerabilities, and large impact.
Makes the case for more clarity around the fundamental ideas, analysis of the arguments, description of deployment scenarios, as well as making more explicit the assumptions behind research agendas.
Ben Cottier, Rohin Shah—August 2019
Creates a diagram linking hypotheses, scenarios, agendas, and catastrophic problems. Selects for debated and important arguments, does not claim to be comprehensive, links ideas through diverse relationships (support, conditional support, entailment, etc.)
The post itself goes into more details on the hypotheses, with resources listed for each one.
Buck Shlegeris—January 2020
The first section of the talk highlights the limits of heuristic arguments, the usefulness of spelling out premises and making a deliberate effort to build compelling arguments for your personal stance.
The talk then proceeds to detail the speaker’s own argument for AI alignment work. Many commenters express their gratitude for all this exposition.
Ben Garfinkel—February 2019
Runs through the intuitive arguments being AI risk prioritization: “AI as a big deal”, instability, lock-in, and accidents. Expands why each of them aren’t forceful, or with missing pieces/details.
Calls for the arguments being fleshed out further as a neglected issue, with potential high value.
Tom Adamczewski—February 2019
Describes the evolution of AI risk arguments, from early descriptions of the alignment problem, to discontinuities as a premise for Bostrom’s Superintelligence, to alignment issues without discontinuity. Also describes non-alignment catastrophes, such as misuse risks.
Calls for clarification of arguments related to AI risk, especially on the subject of discontinuities, for better prioritization, and reduction of costly misunderstandings.
Scenarios, forecasting & strategy
AI Impacts contributors—Since 2014
The website in general is dedicated to building AI forecasting resources, to inform arguments and decisions. Some of their content most closely related to AI risk arguments:
Takeaways from safety by default interviews—April 2020
Paul Christiano—March 2019
Describes two scenarios for AI catastrophe which don’t depend on a fast surprise takeover by a powerful AI system. Also notable for the level of engagement in the comments.
Kaj Sotala—February 2018
Breaks down a wide range of scenarios leading to (at least) catastrophic risk, by decomposing them into a variety of factors: strategic advantage, takeoff speed, autonomy acquisition, plurality of agents, etc.
Explores the idea of there being multiple combinations of factors which may be realized, each of them leading to a catastrophe (as opposed to a specific privileged scenario, which may receive too much focus).
Wei Dai, Daniel Kokotajlo—March 2019 (last updated March 2020)
Thirty-two (and counting) high-level scenarios for AI catastrophe. Wei Dai emphasizes that they aren’t disjunctive, as some scenarios may subsume or cause others. Daniel Kokotajlo (who maintains and updates the list) suggests it could be refined, expanded and reorganized.
Evan Hubinger—November 2019
Reports arguments on the importance of transparency and interpretability, and about how to improve the field of machine learning to make progress on these issues.
Alexey Turchin, David Denkenberger—January 2018
Lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references.
Agendas & reports focused on problem framing
Scott Garrabrant, Abram Demski—November 2018
Clarifies and motivates technical research stemming from the idea of embedded agents, where AI systems are no longer logically separated from their environment, implying modeling and self-modification issues, among others.
Describes the subproblems associated with that hypothesis: decision theory, embedded world-models, robust delegation, and subsystem alignment.
Allan Dafoe—August 2018
From the Center for the Governance of AI, Future of Humanity Institute. The agenda aims for superficial comprehensiveness, gathering as many questions relevant to AI Governance as possible in 53 pages, and providing extensive references for further details. It doesn’t focus on prioritization, nor tractability/impact estimates.
The questions are divided in three clusters: technical landscape (modeling and forecasting AI progress, mapping AI capabilities, and technical AI safety), AI politics (transformation of government, of the job market, and regulatory concerns), and ideal AI governance (desirable values, institutions and scenarios).
Pedro A. Ortega, Vishal Maini, DeepMind—September 2018
Motivates DeepMind’s technical AI safety research, dividing it in three areas: specification (how to define the purpose of a system, whether explicitly designed or emergent), robustness (how to prevent, anticipate, defend against, and recover from perturbations), and assurance (understand, evaluate and actively control the behavior of a system).
The post defines a broad array of technical terms. The challenges are grounded in problems already present in current AI systems, and in simple environments (gridworlds).
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané - June 2016
Describes and motivates five technical safety research problems in machine learning-based systems, tractable through direct experimentation, in toy environments and/or small-scale models. All problems, sub-problems, and proposed abstract solutions are grounded in the existing machine learning literature.
The authors also argue for the increasing relevance of these problems as AI capabilities progress.
K. Eric Drexler—January 2019
Expands on Bostrom’s Superintelligence, through a mesh of forty high-level statements addressing the possibility of an intelligence explosion, the nature of advanced machine intelligence, the relationship between goals and intelligence, the use and control of advanced AI, and near/long-term considerations in AI safety & strategy.
The common underlying framing is a service-centered model of general intelligence, suggesting the integration of a diversity of task-oriented systems built incrementally, rather than mostly independent, self-improving superintelligent agents.
AI alignment reviews
Jacob Steinhardt—October 2019
Outlines four broad categories of technical work: technical alignment (how to create aligned AI), detecting failures (how to proactively check for alignment), methodological understanding (best practices), and system-building (how to do the previous three for large systems).
All problems (or sub-problems, for the first category) are explored through a high-level definition, motivation, solution desiderata, possible research avenues, personal takes, and references.
Richard Mallah - January 2017
Maps a large set of concepts and techniques in AI safety. The core content can be explored in this interactive visualization. The concepts are primarily organized through a hierarchical map, with secondary links for related ideas. All concepts are given high-level descriptions with references.
The stated purpose of the work is to provide a comprehensive map and a reference set of concepts for the field, to be extended through further research.
Rohin Shah—January 2020
The first section of the post is dedicated to recent work in basic AI risk analysis: new explorations of goal-directedness and comprehensive AI services, as well as new write-ups for, or against AI risk (many of which are listed in this very document).
The rest of the post details recent work in the many sub-problems of AI alignment, noting that the over 300 references have been selected from a larger set of around 500 articles, clustered for readability (the reader shouldn’t take the chosen categorization as authoritative).
Sorts AI alignment work by origin, and not by topic. It highlights more specifically the agendas of the various research teams, and lists the collaborations between them. It also references a wide range of independent research.
In addition, the post details the funding of the various organizations involved in the field, as well as methodological comments on prioritization, funding, and research avenues.
Tom Everitt, Gary Lea, Marcus Hutter—May 2018
Focuses specifically on powerful AI systems: plausible conceptual models; forecasting of capability increase and risks; technical safety problems; design ideas and concepts; and public policy.
The paper explores safety problems shared by multiple research agendas, and summarizes a wide range of publications in the domain.
Future of Life Institute—November 2015 (first version)
Summarizes in an accessible way the very high-level case for AI alignment research, the most common naive objections and misconceptions, with further reading references.
Nick Bostrom—July 2014
Makes the case for the risk from superintelligent entities (not necessarily AI systems, though it is presented as its most probable origin). The book represents an early edited, long-form, philosophical introduction to numerous concepts such as the control problem, takeoff speeds, treacherous turn, instrumental convergence, decisive strategic advantage, value loading, and many more.
Stuart Russell—October 2019
Makes the case for the risk from advanced AI systems through failure of alignment. The book describes the continued progress in AI capabilities, reviews critically the major arguments around AI risk and forecasting, and argues for early safety research, showcasing significant hurdles to solve, and possible research avenues.
Holden Karnofsky—May 2016
Makes the philanthropic case for AI risk research, describing three classes of risk: misuse risk (malevolent, or value-locking use of powerful technology), accident risk (stemming typically from alignment failure) and other risks (such as structural effects due to automation, or dissemination of increasingly capable tools). Also explains several principles for prioritization work.
Paul Christiano—June 2019
Decomposes the then-current main approaches in AI alignment research by building a tree diagram and giving friendly high-level explanations of the ideas. The exploration is itself biased towards iterated amplification, which is put in its broader context.
Many authors—From 2014 to 2018
Provides detailed explanations for many concepts in AI Alignment, in an explorable way. Now in an archived state.
Robert Miles—Since 2014
Clear and friendly explanations of many concepts in AI alignment. For introductory material, it is best to start with his Computerphile videos, produced before the channel’s creation.
Technical agendas focused on possible solutions
Paul Christiano—October 2018
Describes iterated amplification, an alignment technique for powerful ML-based systems. Spells out the core hypotheses behind the validity of the techniques. In the fourth section, details the associated research directions, and desiderata for AI alignment research.
Rohin Shah, Paul Christiano, Stuart Armstrong, Jacob Steinhardt, Owain Evans—October 2018
Investigates and motivates value learning, discussing the arguments stemming from the idea of a powerful AI system pursuing a particular utility function, using human behavior as a data source. Clearly restates the core arguments in the conclusion post.
Alex Turner—July 2019
Explores and motivates new ways to work with impact measures, a common component of various approaches in AI safety research, and how to think about scenarios where a powerful AI system makes wide-ranging decisions and actions.
Stuart Armstrong—June 2019
Clarifies and motivates a technical agenda for building specific assumptions into AI systems that would let them infer human preferences, as an instrumental goal for aligning onto them.
G Gordon Worley III—March 2020
Defines a technical agenda for building a formal expression of the structure of human values, modeling them as the input of their decision process.
Vanessa Kosoy—July 2018
Details and motivates philosophically a technical agenda to ground AI alignment in statistical and computational learning theory, as well as algorithmic information theory.
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg—November 2018
DeepMind paper, which defines a more specific agenda around the learning of a reward function through reinforcement learning, from interaction with a human user, in a way that scales to complex and general domains.
Geoffrey Irving, Amanda Askell—February 2019
Explores and motivates the debate approach to alignment, learning human values through experiments, asking questions and arbitrating between arguments. Tied to the AI safety via debate OpenAI paper.
Richard Ngo—October 2019
The first section Studying and understanding safety problems motivates this very project. The entire post is full of interesting problems to solve.
Victoria Krakovna—Regularly updated since August 2017
Provides a wealth of useful references, which significantly helped expand this list. Still receiving updates!
To reiterate, just above the comment box : I’m looking for insights. If your favorite reference is missing; if you spot a glaring error; if you have a strong opinion on research directions; if you share my frustrations, or disagree: do share! (Yes, the post is long, please don’t let that stop you from engaging).