Some AI research areas and their relevance to existential safety

Introduction

This post is an overview of a va­ri­ety of AI re­search ar­eas in terms of how much I think con­tribut­ing to and/​or learn­ing from those ar­eas might help re­duce AI x-risk. By re­search ar­eas I mean “AI re­search top­ics that already have groups of peo­ple work­ing on them and writ­ing up their re­sults”, as op­posed to re­search “di­rec­tions” in which I’d like to see these ar­eas “move”.

I formed these views mostly pur­suant to writ­ing AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES). My hope is that my as­sess­ments in this post can be helpful to stu­dents and es­tab­lished AI re­searchers who are think­ing about shift­ing into new re­search ar­eas speci­fi­cally with the goal of con­tribut­ing to ex­is­ten­tial safety some­how. In these as­sess­ments, I find it im­por­tant to dis­t­in­guish be­tween the fol­low­ing types of value:

  • The helpful­ness of the area to ex­is­ten­tial safety, which I think of as a func­tion of what ser­vices are likely to be pro­vided as a re­sult of re­search con­tri­bu­tions to the area, and whether those ser­vices will be helpful to ex­is­ten­tial safety, versus

  • The ed­u­ca­tional value of the area for think­ing about ex­is­ten­tial safety, which I think of as a func­tion of how much a re­searcher mo­ti­vated by ex­is­ten­tial safety might be­come more effec­tive through the pro­cess of fa­mil­iariz­ing with or con­tribut­ing to that area, usu­ally by fo­cus­ing on ways the area could be used in ser­vice of ex­is­ten­tial safety.

  • The ne­glect of the area at var­i­ous times, which is a func­tion of how much tech­ni­cal progress has been made in the area rel­a­tive to how much I think is needed.

Im­por­tantly:

  • The helpful­ness to ex­is­ten­tial safety scores do not as­sume that your con­tri­bu­tions to this area would be used only for pro­jects with ex­is­ten­tial safety as their mis­sion. n This can nega­tively im­pact the helpful­ness of con­tribut­ing to ar­eas that are more likely to be used in ways that harm ex­is­ten­tial safety.

  • The ed­u­ca­tional value scores are not about the value of an ex­is­ten­tial-safety-mo­ti­vated re­searcher teach­ing about the topic, but rather, learn­ing about the topic.

  • The “ne­glect” scores are not mea­sur­ing whether there is enough “buzz” around the topic, but rather, whether there has been ad­e­quate tech­ni­cal progress in it.

Below is a table of all the ar­eas I con­sid­ered for this post, along with their en­tirely sub­jec­tive “scores” I’ve given them. The rest of this post can be viewed sim­ply as an elab­o­ra­tion/​ex­pla­na­tion of this table:

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Out of Distri­bu­tion Ro­bust­ness

Zero/​
Single

110

410

510

310

110

A­gent Foun­da­tion­s

Zero/​
Single

310

810

910

810

710

Multi-agent RL

Zero/​
Multi

210

610

510

410

010

Prefer­ence Learn­ing

Sin­gle/​
Single

110

410

510

110

010

Side-effect Min­i­miza­tion

Sin­gle/​
Single

410

410

610

510

410

Hu­man-Robot In­ter­ac­tion

Sin­gle/​
Single

610

710

510

410

310

In­ter­pretabil­ity in ML

Sin­gle/​
Single

810

610

810

610

210

Fair­ness in ML

Multi/​
Single

610

510

710

310

210

Com­pu­ta­tional So­cial Choice

Multi/​
Single

710

710

710

510

410

Ac­countabil­ity in ML

Multi/​
Multi

810

310

810

710

510

The re­search ar­eas are or­dered from least-so­cially-com­plex to most-so­cially-com­plex. This roughly (though im­perfectly) cor­re­lates with ad­dress­ing ex­is­ten­tial safety prob­lems of in­creas­ing im­por­tance and ne­glect, ac­cord­ing to me. Cor­re­spond­ingly, the sec­ond column cat­e­go­rizes each area ac­cord­ing to the sim­plest hu­man/​AI so­cial struc­ture it ap­plies to:

Zero/​Sin­gle: Zero-hu­man /​ Sin­gle-AI scenarios

Zero/​Multi: Zero-hu­man /​ Multi-AI scenarios

Sin­gle/​Sin­gle: Sin­gle-hu­man /​ Sin­gle-AI scenarios

Sin­gle/​Multi: Sin­gle-hu­man /​ Multi-AI scenarios

Multi/​Sin­gle: Multi-hu­man /​ Sin­gle-AI scenarios

Multi/​Multi: Multi-hu­man /​ Multi-AI scenarios

Epistemic sta­tus & caveats

I de­vel­oped the views in this post mostly over the course of the two years I spent writ­ing and think­ing about AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES). I make the fol­low­ing caveats:

  1. Th­ese views are my own, and while oth­ers may share them, I do not in­tend to speak in this post for any in­sti­tu­tion or group of which I am part.

  2. I am not an ex­pert in Science, Tech­nol­ogy, and So­ciety (STS). His­tor­i­cally there hasn’t been much fo­cus on ex­is­ten­tial risk within STS, which is why I’m not cit­ing much in the way of sources from STS. How­ever, from its name, STS as a dis­ci­pline ought to be think­ing a lot about AI x-risk. I think there’s a rea­son­able chance of im­prove­ment on this axis over the next 2-3 years, but we’ll see.

  3. I made this post with es­sen­tially zero defer­ence to the judge­ment of other re­searchers. This is aca­dem­i­cally un­usual, and prone to more var­i­ance in what ends up be­ing ex­pressed. It might even be con­sid­ered rude. Nonethe­less, I thought it might be valuable or at least in­ter­est­ing to stim­u­late con­ver­sa­tion on this topic that is less filtered through pat­terns defer­ence to oth­ers. My hope is that peo­ple can be­come less in­hibited in dis­cussing these top­ics if my writ­ing isn’t too “pol­ished”. I might also write a more defer­ent and pol­ished ver­sion of this post some­day, es­pe­cially if nice de­bates arise from this one that I want to dis­till into a fol­low-up post.

Defin­ing our objectives

In this post, I’m go­ing to talk about AI ex­is­ten­tial safety as dis­tinct from both AI al­ign­ment and AI safety as tech­ni­cal ob­jec­tives. A num­ber of blogs seem to treat these terms as near-syn­onyms (e.g., LessWrong, the Align­ment Fo­rum), and I think that is a mis­take, at least when it comes to guid­ing tech­ni­cal work for ex­is­ten­tial safety. First I’ll define these terms, and then I’ll elab­o­rate on why I think it’s im­por­tant not to con­flate them.

AI ex­is­ten­tial safety (defi­ni­tion)

In this post, AI ex­is­ten­tial safety means “pre­vent­ing AI tech­nol­ogy from pos­ing risks to hu­man­ity that are com­pa­rable or greater than hu­man ex­tinc­tion in terms of their moral sig­nifi­cance.”

This is a bit more gen­eral than the defi­ni­tion in ARCHES. I be­lieve this defi­ni­tion is fairly con­sis­tent with Bostrom’s us­age of the term “ex­is­ten­tial risk”, and will have rea­son­able stay­ing power as the term “AI ex­is­ten­tial safety” be­comes more pop­u­lar, be­cause it di­rectly ad­dresses the ques­tion “What does this term have to do with ex­is­tence?”.

AI safety (defi­ni­tion)

AI safety gen­er­ally means get­ting AI sys­tems to avoid risks, of which ex­is­ten­tial safety is an ex­treme spe­cial case with unique challenges. This us­age is con­sis­tent with nor­mal ev­ery­day us­age of the term “safety” (dic­tio­nary.com/​​browse/​​safety), and will have rea­son­able stay­ing power as the term “AI safety” be­comes (even) more pop­u­lar. AI safety in­cludes safety for self-driv­ing cars as well as for su­per­in­tel­li­gences, in­clud­ing is­sues that these top­ics do and do not share in com­mon.

AI ethics (defi­ni­tion)

AI ethics gen­er­ally refers to prin­ci­ples that AI de­vel­op­ers and sys­tems should fol­low. The “should” here cre­ates a space for de­bate, whereby many peo­ple and in­sti­tu­tions can try to im­pose their val­ues on what prin­ci­ples be­come ac­cepted. Often this means AI ethics dis­cus­sions be­come de­bates about edge cases that peo­ple dis­agree about in­stead of col­lab­o­ra­tions on what they agree about. On the other hand, if there is a prin­ci­ple that all or most de­bates about AI ethics would agree on or take as a premise, that prin­ci­ple be­comes some­what eas­ier to en­force.

AI gov­er­nance (defi­ni­tion)

AI gov­er­nance gen­er­ally refers to iden­ti­fy­ing and en­forc­ing norms for AI de­vel­op­ers and AI sys­tems them­selves to fol­low. The ques­tion of which prin­ci­ples should be en­forced of­ten opens up de­bates about safety and ethics. Gover­nance de­bates are a bit more ac­tion-ori­ented than purely eth­i­cal de­bates, such that more effort is fo­cussed on en­forc­ing agree­able norms rel­a­tive to de­bat­ing about dis­agree­able norms. Thus, AI gov­er­nance, as an area of hu­man dis­course, is en­gaged with the prob­lem of al­ign­ing the de­vel­op­ment and de­ploy­ment of AI tech­nolo­gies with broadly agree­able hu­man val­ues. Whether AI gov­er­nance is en­gaged with this prob­lem well or poorly is, of course, a mat­ter of de­bate.

AI al­ign­ment (defi­ni­tion)

AI al­ign­ment usu­ally means “Get­ting an AI sys­tem to {try | suc­ceed} to do what a hu­man per­son or in­sti­tu­tion wants it to do”. The in­clu­sion of “try” or “suc­ceed” re­spec­tively cre­ates a dis­tinc­tion be­tween in­tent al­ign­ment and im­pact al­ign­ment. This us­age is con­sis­tent with nor­mal ev­ery­day us­age of the term “al­ign­ment” (dic­tio­nary.com/​​browse/​​al­ign­ment) as used to re­fer to al­ign­ment of val­ues be­tween agents, and is there­fore rel­a­tively un­likely to un­dergo defi­ni­tion-drift as the term “AI al­ign­ment” be­comes more pop­u­lar. For in­stance,

  • (2002) “Align­ment” was used this way in 2002 by Daniel Shapiro and Ross Shachter, in their AAAI con­fer­ence pa­per User/​Agent Value Align­ment, the first pa­per to in­tro­duce the con­cept of al­ign­ment into AI re­search. This work was not mo­ti­vated by ex­is­ten­tial safety as far as I know, and is not cited in any of the more re­cent liter­a­ture on “AI al­ign­ment” mo­ti­vated by ex­is­ten­tial safety, though I think it got off to a rea­son­ably good start in defin­ing user/​agent value al­ign­ment.

  • (2014) “Align­ment” was used this way in the tech­ni­cal prob­lems de­scribed by Nate Soares and Benya Fallen­stein in Align­ing Su­per­in­tel­li­gence with Hu­man In­ter­ests: A Tech­ni­cal Re­search Agenda. While the au­thors’ mo­ti­va­tion is clearly to serve the in­ter­ests of all hu­man­ity, the tech­ni­cal prob­lems out­lined are all about im­pact al­ign­ment in my opinion, with the pos­si­ble ex­cep­tion of what they call “Vingean Reflec­tion” (which is nec­es­sary for a sub­agent of so­ciety think­ing about so­ciety).

  • (2018) “Align­ment” is used this way by Paul Chris­ti­ano in his post Clar­ify­ing AI Align­ment, which is fo­cussed on in­tent al­ign­ment.

A broader mean­ing of “AI al­ign­ment” that is not used here

There is an­other, differ­ent us­age of “AI al­ign­ment”, which refers to en­sur­ing that AI tech­nol­ogy is used and de­vel­oped in ways that are broadly al­igned with hu­man val­ues. I think this is an im­por­tant ob­jec­tive that is de­serv­ing of a name to call more tech­ni­cal at­ten­tion to it, and per­haps this is the spirit in which the “AI al­ign­ment fo­rum” is so-ti­tled. How­ever, the term “AI al­ign­ment” already has poor stay­ing-power for refer­ring to this ob­jec­tive in tech­ni­cal dis­course out­side of a rel­a­tively clois­tered com­mu­nity, for two rea­sons:

  1. As de­scribed above, “al­ign­ment” already has a rel­a­tively clear tech­ni­cal mean­ing that AI re­searchers have already grav­i­tated to­wards in­ter­pret­ing “al­ign­ment” to mean, that is also con­sis­tent with nat­u­ral lan­guage mean­ing of the term “al­ign­ment”, and

  2. AI gov­er­nance, at least in demo­cratic states, is ba­si­cally already about this broader prob­lem. If one wishes to talk about AI gov­er­nance that is benefi­cial to most or all hu­mans, “hu­man­i­tar­ian AI gov­er­nance” is much clearer and more likely to stick than “AI al­ign­ment”.

Per­haps “global al­ign­ment”, “civ­i­liza­tional al­ign­ment”, or “uni­ver­sal AI al­ign­ment” would make sense to dis­t­in­guish this con­cept from the nar­rower mean­ing that al­ign­ment usu­ally takes on in tech­ni­cal set­tings. In any case, for the du­ra­tion of this post, I am us­ing “al­ign­ment” to re­fer to its nar­rower, tech­ni­cally preva­lent mean­ing.

Dist­in­guish­ing our objectives

As promised, I will now elab­o­rate on why it’s im­por­tant not to con­flate the ob­jec­tives above. Some peo­ple might feel that these ar­gu­ments are about how im­por­tant these con­cepts are, but I’m mainly try­ing to ar­gue about how im­por­tantly differ­ent they are. By anal­ogy: while knives and forks are both im­por­tant tools for din­ing, they are not us­able in­ter­change­ably.

Safety vs ex­is­ten­tial safety (dis­tinc­tion)

“Safety” is not ro­bustly us­able as a syn­onym for “ex­is­ten­tial safety”. It is true that AI ex­is­ten­tial safety is liter­ally a spe­cial case of AI safety, for the sim­ple rea­son that avoid­ing ex­is­ten­tial risk is a spe­cial case of avoid­ing risk. And, it may seem use­ful for coal­i­tion-build­ing pur­poses to unite peo­ple un­der the phrase “AI safety” as a broadly agree­able ob­jec­tive. How­ever, I think we should avoid declar­ing to our­selves or oth­ers that “AI safety” will or should always be in­ter­preted as mean­ing “AI ex­is­ten­tial safety”, for sev­eral rea­sons:

  1. Us­ing these terms as syn­onyms will have very lit­tle stay­ing power as AI safety re­search be­comes (even) more pop­u­lar.

  2. AI ex­is­ten­tial safety is de­serv­ing of di­rect at­ten­tion that is not filtered through a lens of dis­course that con­fuses it with self-driv­ing car safety.

  3. AI safety in gen­eral is de­serv­ing of at­ten­tion as a broadly agree­able prin­ci­ple around which peo­ple can form al­li­ances and share ideas.

Align­ment vs ex­is­ten­tial safety (dis­tinc­tion)

Some peo­ple tend to use these terms as as near-syn­onyms, how­ever, I think this us­age has some im­por­tant prob­lems:

  1. Us­ing “al­ign­ment” and “ex­is­ten­tial safety” as syn­onyms will have poor stay­ing-power as the term “AI al­ign­ment” be­comes more pop­u­lar. Con­flat­ing them will offend both the peo­ple who want to talk about ex­is­ten­tial safety (be­cause they think it is more im­por­tant and “ob­vi­ously what we should be talk­ing about”) as well as the peo­ple who want to talk about AI al­ign­ment (be­cause they think it is more im­por­tant and “ob­vi­ously what we should be talk­ing about”).

  2. AI al­ign­ment refers to a cluster of tech­ni­cally well-defined prob­lems that are im­por­tant to work on for nu­mer­ous rea­sons, and de­serv­ing of a name that does not se­cretly mean “pre­vent­ing hu­man ex­tinc­tion” or similar.

  3. AI ex­is­ten­tial safety (I claim) also refers to a tech­ni­cally well-defin­able prob­lem that is im­por­tant to work on, and de­serv­ing of a name that does not se­cretly mean “get­ting sys­tems to do what the user is ask­ing”.

  4. AI al­ign­ment is not triv­ially helpful to ex­is­ten­tial safety, and efforts to make it helpful re­quire a cer­tain amount of so­cietal-scale steer­ing to guide them. If we treat these terms as syn­onyms, we im­pov­er­ish our col­lec­tive aware­ness of ways in which AI al­ign­ment solu­tions could pose novel prob­lems for ex­is­ten­tial safety.

This last point gets its own sec­tion.

AI al­ign­ment is in­ad­e­quate for AI ex­is­ten­tial safety

Around 50% of my mo­ti­va­tion for writ­ing this post is my con­cern that progress in AI al­ign­ment, which is usu­ally fo­cused on “sin­gle/​sin­gle” in­ter­ac­tions (i.e., al­ign­ment for a sin­gle hu­man stake­holder and a sin­gle AI sys­tem), is in­ad­e­quate for en­sur­ing ex­is­ten­tial safety for ad­vanc­ing AI tech­nolo­gies. In­deed, among prob­lems I can cur­rently see in the world that I might have some abil­ity to in­fluence, ad­dress­ing this is­sue is cur­rently one of my top pri­ori­ties.

The rea­son for my con­cern here is pretty sim­ple to state, via the fol­low­ing two di­a­grams:

Of course, un­der­stand­ing and de­sign­ing use­ful and mod­u­lar sin­gle/​sin­gle in­ter­ac­tions is a good first step to­ward un­der­stand­ing multi/​multi in­ter­ac­tions, and many peo­ple (in­clud­ing my­self) who think about AI al­ign­ment are think­ing about it as a step­ping stone to un­der­stand­ing the broader so­cietal-scale ob­jec­tive of en­sur­ing ex­is­ten­tial safety.

How­ever, this pat­tern mir­rors the situ­a­tion AI ca­pa­bil­ities re­search was fol­low­ing be­fore safety, ethics, and al­ign­ment be­gan surg­ing in pop­u­lar­ity. Con­sider that most AI (con­strued to in­clude ML) re­searchers are de­vel­op­ing AI ca­pa­bil­ities as step­ping stones to­ward un­der­stand­ing and de­ploy­ing those ca­pa­bil­ities in safe and value-al­igned ap­pli­ca­tions for hu­man users. De­spite this, over the past decade there has been a grow­ing sense among AI re­searchers that ca­pa­bil­ities re­search has not been suffi­ciently for­ward-look­ing in terms of an­ti­ci­pat­ing its role in so­ciety, in­clud­ing the need for safety, ethics, and al­ign­ment work. This gen­eral con­cern can be seen em­a­nat­ing not only from AGI-safety-ori­ented groups like those at Deep­Mind, OpenAI, MIRI, and in academia, but also AI-ethics-ori­ented groups as well, such as the ACM Fu­ture of Com­put­ing Academy:

https://​​acm-fca.org/​​2018/​​03/​​29/​​nega­tiveim­pacts/​​

Just as folks in­ter­ested in AI safety and ethics needed to start think­ing be­yond ca­pa­bil­ities, folks in­ter­ested in AI ex­is­ten­tial safety need to start think­ing be­yond al­ign­ment. The next sec­tion de­scribes what I think this means for tech­ni­cal work.

An­ti­ci­pat­ing, le­gi­t­imiz­ing and fulfilling gov­er­nance demands

The main way I can see pre­sent-day tech­ni­cal re­search benefit­ting ex­is­ten­tial safety is by an­ti­ci­pat­ing, le­gi­t­imiz­ing and fulfilling gov­er­nance de­mands for AI tech­nol­ogy that will arise over the next 10-30 years. In short, there of­ten needs to be some amount of trac­tion on a tech­ni­cal area be­fore it’s poli­ti­cally vi­able for gov­ern­ing bod­ies to de­mand that in­sti­tu­tions ap­ply and im­prove upon solu­tions in those ar­eas. Here’s what I mean in more de­tail:

By gov­er­nance de­mands, I’m refer­ring to so­cial and poli­ti­cal pres­sures to en­sure AI tech­nolo­gies will pro­duce or avoid cer­tain so­cietal-scale effects. Gover­nance de­mands in­clude pres­sures like “AI tech­nol­ogy should be fair”, “AI tech­nol­ogy should not de­grade civic in­tegrity”, or “AI tech­nol­ogy should not lead to hu­man ex­tinc­tion.” For in­stance, Twit­ter’s re­cent pub­lic de­ci­sion to main­tain a civic in­tegrity policy can be viewed as a re­sponse to gov­er­nance de­mand from its own em­ploy­ees and sur­round­ing civic so­ciety.

Gover­nance de­mand is dis­tinct from con­sumer de­mand, and it yields a differ­ent kind of trans­ac­tion when the de­mand is met. In par­tic­u­lar, when a tech com­pany fulfills a gov­er­nance de­mand, the com­pany le­gi­t­imizes that de­mand by pro­vid­ing ev­i­dence that it is pos­si­ble to fulfill. This might re­quire the com­pany to break ranks with other tech­nol­ogy com­pa­nies who deny that the de­mand is tech­nolog­i­cally achiev­able.

By le­gi­t­imiz­ing gov­er­nance de­mands, I mean mak­ing it eas­ier to es­tab­lish com­mon knowl­edge that a gov­er­nance de­mand is likely to be­come a le­gal or pro­fes­sional stan­dard. But how can tech­ni­cal re­search le­gi­t­imize de­mands from a non-tech­ni­cal au­di­ence?

The an­swer is to gen­uinely demon­strate in ad­vance that the gov­er­nance de­mands are fea­si­ble to meet. Pass­ing a given pro­fes­sional stan­dard or leg­is­la­tion usu­ally re­quires the de­mands in it to be “rea­son­able” in terms of ap­pear­ing to be tech­nolog­i­cally achiev­able. Thus, com­puter sci­en­tists can help le­gi­t­imize a gov­er­nance de­mand by an­ti­ci­pat­ing the de­mand in ad­vance, and be­gin­ning to pub­lish solu­tions for it. My po­si­tion here is not that the solu­tions should be ex­ag­ger­ated in their com­plete­ness, even if that will in­crease ‘le­gi­t­i­macy’; I ar­gue only that we should fo­cus en­ergy on find­ing solu­tions that, if com­mu­ni­cated broadly and truth­fully, will gen­uinely raise con­fi­dence that im­por­tant gov­er­nance de­mands are fea­si­ble. (Without this ethic against ex­ag­ger­a­tion, com­mon knowl­edge in the le­gi­t­i­macy of le­gi­t­i­macy it­self is de­graded, which is bad, so we shouldn’t ex­ag­ger­ate.)

This kind of work can make a big differ­ence to the fu­ture. If the al­gorith­mic tech­niques needed to meet a given gov­er­nance de­mand are 10 years of re­search away from dis­cov­ery—as op­posed to just 1 year—then it’s eas­ier for large com­pa­nies to in­ten­tion­ally or in­ad­ver­tently main­tain a nar­ra­tive that the de­mand is un­fulfillable and there­fore ille­gi­t­i­mate. Con­versely, if the al­gorith­mic tech­niques to fulfill the de­mand already ex­ist, it’s a bit harder (though still pos­si­ble) to deny the le­gi­t­i­macy of the de­mand. Thus, CS re­searchers can le­gi­t­imize cer­tain de­mands in ad­vance, by be­gin­ning to pre­pare solu­tions for them.

I think this is the most im­por­tant kind of work a com­puter sci­en­tist can do in ser­vice of ex­is­ten­tial safety. For in­stance, I view ML fair­ness and in­ter­pretabil­ity re­search as re­spond­ing to ex­ist­ing gov­er­nance de­mand, which le­gi­t­imizes the cause of AI gov­er­nance it­self, which is hugely im­por­tant. Fur­ther­more, I view com­pu­ta­tional so­cial choice re­search as ad­dress­ing an up­com­ing gov­er­nance de­mand, which is even more im­por­tant.

My hope in writ­ing this post is that some of the read­ers here will start try­ing to an­ti­ci­pate AI gov­er­nance de­mands that will arise over the next 10-30 years. In do­ing so, we can be­gin to think about tech­ni­cal prob­lems and solu­tions that could gen­uinely le­gi­t­imize and fulfill those de­mands when they arise, with a fo­cus on de­mands whose fulfill­ment can help sta­bi­lize so­ciety in ways that miti­gate ex­is­ten­tial risks.

Re­search Areas

Alright, let’s talk about some re­search!

Out of dis­tri­bu­tion ro­bust­ness (OODR)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Out of Distri­bu­tion Ro­bust­ness

Zero/​Single

110

410

510

310

110

This area of re­search is con­cerned with avoid­ing risks that arise from sys­tems in­ter­act­ing with con­texts and en­vi­ron­ments that are chang­ing sig­nifi­cantly over time, such as from train­ing time to test­ing time, from test­ing time to de­ploy­ment time, or from con­trol­led de­ploy­ments to un­con­trol­led de­ploy­ments.

OODR (un)helpful­ness to ex­is­ten­tial safety:

Con­tri­bu­tions to OODR re­search are not par­tic­u­larly helpful to ex­is­ten­tial safety in my opinion, for a com­bi­na­tion of two rea­sons:

  1. Progress in OODR will mostly be used to help roll out more AI tech­nolo­gies into ac­tive de­ploy­ment more quickly, and

  2. Re­search in this area usu­ally does not in­volve deep or lengthy re­flec­tions about the struc­ture of so­ciety and hu­man val­ues and in­ter­ac­tions, which I think makes this field sort of col­lec­tively blind to the con­se­quences of the tech­nolo­gies it will help build.

I think this area would be more helpful if it were more at­ten­tive to the struc­ture of the multi-agent con­text that AI sys­tems will be in. Pro­fes­sor Tom Diet­ter­ich has made some at­tempts to shift think­ing on ro­bust­ness to be more at­ten­tive to the struc­ture of ro­bust hu­man in­sti­tu­tions, which I think is a good step:

Un­for­tu­nately, the above pa­per has only 8 cita­tions at the time of writ­ing (very lit­tle for AI/​ML), and there does not seem to be much else in the way of pub­li­ca­tions that ad­dress so­cietal-scale or even in­sti­tu­tional-scale ro­bust­ness.

OODR ed­u­ca­tional value:

Study­ing and con­tribut­ing to OODR re­search is of mod­er­ate ed­u­ca­tional value for peo­ple think­ing about x-risk, in my opinion. Speak­ing for my­self, it helps me think about how so­ciety as a whole is re­ceiv­ing a chang­ing dis­tri­bu­tion of in­puts from its en­vi­ron­ment (which so­ciety it­self is cre­at­ing). As hu­man so­ciety changes, the in­puts to AI tech­nolo­gies will change, and we want the ex­is­tence of hu­man so­ciety to be ro­bust to those changes. I don’t think most re­searchers in this area think about it in that way, but that doesn’t mean you can’t.

OODR ne­glect:

Ro­bust­ness to chang­ing en­vi­ron­ments has never been a par­tic­u­larly ne­glected con­cept in the his­tory of au­toma­tion, and it is not likely to ever be­come ne­glected, be­cause my­opic com­mer­cial in­cen­tives push so strongly in fa­vor of progress on it. Speci­fi­cally, ro­bust­ness of AI sys­tems is es­sen­tial for tech com­pa­nies to be able to roll out AI-based prod­ucts and ser­vices, so there is no lack of in­cen­tive for the tech in­dus­try to work on ro­bust­ness. In re­in­force­ment learn­ing speci­fi­cally, ro­bust­ness has been some­what ne­glected, al­though less so now than in 2015, partly thanks to AI safety (broadly con­strued) tak­ing off. I think by 2030 this area will be even less ne­glected, even in RL.

OODR ex­em­plars:

Re­cent ex­em­plars of high value to ex­is­ten­tial safety, ac­cord­ing to me:

Re­cent ex­em­plars of high ed­u­ca­tional value, ac­cord­ing to me:

Agent foun­da­tions (AF)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

A­gent Foun­da­tion­s

Zero/​Single

310

810

910

810

710

This area is con­cerned with de­vel­op­ing and in­ves­ti­gat­ing fun­da­men­tal defi­ni­tions and the­o­rems per­tain­ing to the con­cept of agency. This of­ten in­cludes work in ar­eas such as de­ci­sion the­ory, game the­ory, and bounded ra­tio­nal­ity. I’m go­ing to write more for this sec­tion be­cause I know more about it and think it’s pretty im­por­tant to “get right”.

AF (un)helpful­ness to ex­is­ten­tial safety:

Con­tri­bu­tions to agent foun­da­tions re­search are key to the foun­da­tions of AI safety and ethics, but are also po­ten­tially mi­sus­able. Thus, ar­bi­trary con­tri­bu­tions to this area are not nec­es­sar­ily helpful, while tar­geted con­tri­bu­tions aimed at ad­dress­ing real-world eth­i­cal prob­lems could be ex­tremely helpful. Here is why I be­lieve this:

I view agent foun­da­tions work as look­ing very closely at the fun­da­men­tal build­ing blocks of so­ciety, i.e., agents and their de­ci­sions. It’s im­por­tant to un­der­stand agents and their ba­sic op­er­a­tions well, be­cause we’re prob­a­bly go­ing to pro­duce (or al­low) a very large num­ber of them to ex­ist/​oc­cur. For in­stance, imag­ine any of the fol­low­ing AI-re­lated op­er­a­tions hap­pen­ing at least 1,000,000 times (a mod­est num­ber given the cur­rent world pop­u­la­tion):

  1. A hu­man be­ing del­e­gates a task to an AI sys­tem to perform, thereby ced­ing some con­trol over the world to the AI sys­tem.

  2. An AI sys­tem makes a de­ci­sion that might yield im­por­tant con­se­quences for so­ciety, and acts on it.

  3. A com­pany de­ploys an AI sys­tem into a new con­text where it might have im­por­tant side effects.

  4. An AI sys­tem builds or up­grades an­other AI sys­tem (pos­si­bly it­self) and de­ploys it.

  5. An AI sys­tem in­ter­acts with an­other AI sys­tem, pos­si­bly yield­ing ex­ter­nal­ities for so­ciety.

  6. An hour passes where AI tech­nol­ogy is ex­ert­ing more con­trol over the state of the Earth than hu­mans are.

Sup­pose there’s some class of nega­tive out­comes (e.g. hu­man ex­tinc­tion) that we want to never oc­cur as a re­sult of any of these op­er­a­tions. In or­der to be just 55% sure that all of these 1,000,000 op­er­a­tions will be safe (i.e., avoid the nega­tive out­come class), on av­er­age (on a log scale) we need to be at least 99.99994% sure that each in­stance of the op­er­a­tion is safe (i.e., will not pre­cip­i­tate the nega­tive out­come). Similarly, for any ac­cu­mu­la­ble quan­tity of “so­cietal de­struc­tion” (such as risk, pol­lu­tion, or re­source ex­haus­tion), in or­der to be sure that these op­er­a­tions will not yield “100 units” of so­cietal de­struc­tion, we need each op­er­a­tion on av­er­age to pro­duce at most “0.00001 units” of de­struc­tion.*

(*Would-be-foot­note: In­ci­den­tally, the main rea­son I think OODR re­search is ed­u­ca­tion­ally valuable is that it can even­tu­ally help with ap­ply­ing agent foun­da­tions re­search to so­cietal-scale safety. Speci­fi­cally: how can we know if one of the op­er­a­tions (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we ap­plied it in a con­trol­led set­ting, but the set­ting is chang­ing over time? This is a spe­cial case of an OODR ques­tion.)

Un­for­tu­nately, un­der­stand­ing the build­ing blocks of so­ciety can also al­low the cre­ation of po­tent so­cietal forces that would harm so­ciety. For in­stance, un­der­stand­ing hu­man de­ci­sion-mak­ing ex­tremely well might help ad­ver­tis­ing com­pa­nies to con­trol pub­lic opinion to an un­rea­son­able de­gree (which ar­guably has already hap­pened, even with to­day’s rudi­men­tary agent mod­els), or it might en­able the con­struc­tion of a su­per-de­ci­sion-mak­ing sys­tem that is mis­al­igned with hu­man ex­is­tence.

That said, I don’t think this means you have to be su­per care­ful about in­for­ma­tion se­cu­rity around agent foun­da­tions work, be­cause in gen­eral it’s not easy to com­mu­ni­cate fun­da­men­tal the­o­ret­i­cal re­sults in re­search, let alone by ac­ci­dent.

Rather, my recom­men­da­tion for max­i­miz­ing the pos­i­tive value of work in this area is to ap­ply the in­sights you get from it to ar­eas that make it eas­ier to rep­re­sent so­cietal-scale moral val­ues in AI. E.g., I think ap­pli­ca­tions of agent foun­da­tions re­sults to in­ter­pretabil­ity, fair­ness, com­pu­ta­tional so­cial choice, and ac­countabil­ity are prob­a­bly net good, whereas ap­pli­ca­tions to speed up ar­bi­trary ML ca­pa­bil­ities are not ob­vi­ously good.

AF ed­u­ca­tional value:

Study­ing and con­tribut­ing to agent foun­da­tions re­search has the high­est ed­u­ca­tional value for think­ing about x-risk among the re­search ar­eas listed here, in my opinion. The rea­son is that agent foun­da­tions re­search does the best job of ques­tion­ing po­ten­tially faulty as­sump­tions un­der­pin­ning our ap­proach to ex­is­ten­tial safety. In par­tic­u­lar, I think our un­der­stand­ing of how to safely in­te­grate AI ca­pa­bil­ities with so­ciety is in­creas­ingly con­tin­gent on our un­der­stand­ing of agent foun­da­tions work as defin­ing the build­ing blocks of so­ciety.

AF ne­glect:

This area is ex­tremely ne­glected in my opinion. I think around 50% of the progress in this area, wor­ld­wide, hap­pens at MIRI, which has a rel­a­tively small staff of agent foun­da­tions re­searchers. While MIRI has grown over the past 5 years, agent foun­da­tions work in academia hasn’t grown much, and I don’t ex­pect it to grow much by de­fault (though per­haps posts like this might change that de­fault).

AF ex­em­plars:

Below are re­cent ex­em­plars of agent foun­da­tions work that I think is of rel­a­tively high value to ex­is­ten­tial safety, mostly via their ed­u­ca­tional value to un­der­stand­ing the foun­da­tions of how agents work (“agent foun­da­tions”). The work is mostly from three main clusters: MIRI, Vin­cent Conitzer’s group at Duke, and Joe Halpern’s group at Cor­nell.

Multi-agent re­in­force­ment learn­ing (MARL)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Multi-agent RL

Zero/​Multi

210

610

510

410

010

MARL is con­cerned with train­ing mul­ti­ple agents to in­ter­act with each other and solve prob­lems us­ing re­in­force­ment learn­ing. There are a few va­ri­eties to be aware of:

  • Co­op­er­a­tive vs com­pet­i­tive vs ad­ver­sar­ial tasks: do the agents all share a sin­gle ob­jec­tive, or sep­a­rate ob­jec­tives that are im­perfectly al­igned, or com­pletely op­posed (zero-sum) ob­jec­tive?

  • Cen­tral­ized train­ing vs de­cen­tral­ized train­ing: is there a cen­tral­ized pro­cess that ob­serves the agents and con­trols how they learn, or is there a sep­a­rate (pri­vate) learn­ing pro­cess for each agent?

  • Com­mu­nica­tive vs non-com­mu­nica­tive: is there a spe­cial chan­nel the agents can use to gen­er­ate ob­ser­va­tions for each other that are oth­er­wise in­con­se­quen­tial, or are all ob­ser­va­tions gen­er­ated in the course of con­se­quen­tial ac­tions?

I think the most in­ter­est­ing MARL re­search in­volves de­cen­tral­ized train­ing for com­pet­i­tive ob­jec­tives in com­mu­nica­tive en­vi­ron­ments, be­cause this set-up is the most rep­re­sen­ta­tive of how AI sys­tems from di­verse hu­man in­sti­tu­tions are likely to in­ter­act.

MARL (un)helpful­ness to ex­is­ten­tial safety:

Con­tri­bu­tions to MARL re­search are mostly not very helpful to ex­is­ten­tial safety in my opinion, be­cause MARL’s most likely use case will be to help com­pa­nies to de­ploy fleets of rapidly in­ter­act­ing ma­chines that might pose risks to hu­man so­ciety. The MARL pro­jects with the great­est po­ten­tial to help are prob­a­bly those that find ways to achieve co­op­er­a­tion be­tween de­cen­trally trained agents in a com­pet­i­tive task en­vi­ron­ment, be­cause of its po­ten­tial to min­i­mize de­struc­tive con­flicts be­tween fleets of AI sys­tems that cause col­lat­eral dam­age to hu­man­ity. That said, even this area of re­search risks mak­ing it eas­ier for fleets of ma­chines to co­op­er­ate and/​or col­lude at the ex­clu­sion of hu­mans, in­creas­ing the risk of hu­mans be­com­ing grad­u­ally dis­en­fran­chised and per­haps re­placed en­tirely by ma­chines that are bet­ter and faster at co­op­er­a­tion than hu­mans.

MARL ed­u­ca­tional value:

I think MARL has a high ed­u­ca­tional value, be­cause it helps re­searchers to ob­serve di­rectly how difficult it is to get multi-agent sys­tems to be­have well. I think most of the ex­is­ten­tial risk from AI over the next decades and cen­turies comes from the in­cred­ible com­plex­ity of be­hav­iors pos­si­ble from multi-agent sys­tems, and from un­der­es­ti­mat­ing that com­plex­ity be­fore it takes hold in the real world and pro­duces un­ex­pected nega­tive side effects for hu­man­ity.

MARL ne­glect:

MARL was some­what ne­glected 5 years ago, but has picked up a lot. I sus­pect MARL will keep grow­ing in pop­u­lar­ity be­cause of its value as a source of cur­ricula for learn­ing al­gorithms. I don’t think it is likely to be­come more civic-minded, un­less ar­gu­ments along the lines of this post lead to a shift of think­ing in the field.

MARL ex­em­plars:

Re­cent ex­em­plars of high ed­u­ca­tional value, ac­cord­ing to me:

Prefer­ence learn­ing (PL)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Prefer­ence Learn­ing

Sin­gle/​Single

110

410

510

110

010

This area is con­cerned with learn­ing about hu­man prefer­ences in a form us­able for guid­ing the poli­cies of ar­tifi­cial agents. In an RL (re­in­force­ment learn­ing) set­ting, prefer­ence learn­ing is of­ten called re­ward learn­ing, be­cause the learned prefer­ences take the form or a re­ward func­tion for train­ing an RL sys­tem.

PL (un)helpful­ness to ex­is­ten­tial safety:

Con­tri­bu­tions to prefer­ence learn­ing are not par­tic­u­larly helpful to ex­is­ten­tial safety in my opinion, be­cause their most likely use case is for mod­el­ing hu­man con­sumers just well enough to cre­ate prod­ucts they want to use and/​or ad­ver­tise­ments they want to click on. Such ad­vance­ments will be helpful to rol­ling out us­able tech prod­ucts and plat­forms more quickly, but not par­tic­u­larly helpful to ex­is­ten­tial safety.*

Prefer­ence learn­ing is of course helpful to AI al­ign­ment, i.e., the prob­lem of get­ting an AI sys­tem to do some­thing a hu­man wants. Please re­fer back to the sec­tions above on Defin­ing our ob­jec­tives and Dist­in­guish­ing our ob­jec­tives for an elab­o­ra­tion of how this is not the same as AI ex­is­ten­tial safety. In any case, I see AI al­ign­ment in turn as hav­ing two main po­ten­tial ap­pli­ca­tions to ex­is­ten­tial safety:

  1. AI al­ign­ment is use­ful as a metaphor for think­ing about how to al­ign the global effects of AI tech­nol­ogy with hu­man ex­is­tence, a ma­jor con­cern for AI gov­er­nance at a global scale, and

  2. AI al­ign­ment solu­tions could be used di­rectly to gov­ern pow­er­ful AI tech­nolo­gies de­signed speci­fi­cally to make the world safer.

While many re­searchers in­ter­ested in AI al­ign­ment are mo­ti­vated by (a) or (b), I find these path­ways of im­pact prob­le­matic. Speci­fi­cally,

  • (1) elides the com­plex­ities of multi-agent in­ter­ac­tions I think are likely to arise in most re­al­is­tic fu­tures, and I think the most difficult to re­solve ex­is­ten­tial risks arise from those in­ter­ac­tions.

  • (2) is es­sen­tially aiming to take over the world in the name of mak­ing it safer, which is not gen­er­ally con­sid­ered the kind of thing we should be en­courag­ing lots of peo­ple to do.

More­over, I be­lieve con­tri­bu­tions to AI al­ign­ment are also gen­er­ally un­helpful to ex­is­ten­tial safety, for the same rea­sons as prefer­ence learn­ing. Speci­fi­cally, progress in AI al­ign­ment has­tens the pace at which high-pow­ered AI sys­tems will be rol­led out into ac­tive de­ploy­ment, short­en­ing so­ciety’s head­way for es­tab­lish­ing in­ter­na­tional treaties gov­ern­ing the use of AI tech­nolo­gies.

Thus, the ex­is­ten­tial safety value of AI al­ign­ment re­search in its cur­rent tech­ni­cal for­mu­la­tions—and prefer­ence learn­ing as a sub­prob­lem of it—re­mains ed­u­ca­tional in my view.*

(*Would-be-foot­note: I hope no one will be too offended by this view. I did have some trep­i­da­tion about ex­press­ing it on the “al­ign­ment’ fo­rum, but I think I should voice these con­cerns any­way, for the fol­low­ing rea­son. In 2011 af­ter some months of re­flec­tion on a pre­sen­ta­tion by An­drew Ng, I came to be­lieve that that deep learn­ing was prob­a­bly go­ing to take off, and that, con­trary to Ng’s opinion, this would trig­ger a need for a lot of AI al­ign­ment work in or­der to make the tech­nol­ogy safe. This feel­ing of worry is what trig­gered me to cofound CFAR and start helping to build a com­mu­nity that thinks more crit­i­cally about the fu­ture. I cur­rently have a similar feel­ing of worry to­ward prefer­ence learn­ing and AI al­ign­ment, i.e., that it is go­ing to take off and trig­ger a need for a lot more “AI ci­vil­ity” work that seems re­dun­dant or “too soon to think about” for a lot of AI al­ign­ment re­searchers to­day, the same way that AI re­searchers said it was “too soon to think about” AI al­ign­ment. To the ex­tent that I think I was right to be wor­ried about AI progress kick­ing off in the decade fol­low­ing 2011, I think I’m right to be wor­ried again now about prefer­ence learn­ing and AI al­ign­ment (in its nar­row and so­cially-sim­plis­tic tech­ni­cal for­mu­la­tions) tak­ing off in the 2020’s and 2030’s.)

PL ed­u­ca­tional value:

Study­ing and mak­ing con­tri­bu­tions to prefer­ence learn­ing is of mod­er­ate ed­u­ca­tional value for think­ing about ex­is­ten­tial safety in my opinion. The rea­son is this: if we want ma­chines to re­spect hu­man prefer­ences—in­clud­ing our prefer­ence to con­tinue ex­ist­ing—we may need pow­er­ful ma­chine in­tel­li­gences to un­der­stand our prefer­ences in a form they can act on. Of course, be­ing un­der­stood by a pow­er­ful ma­chine is not nec­es­sar­ily a good thing. But if the ma­chine is go­ing to do good things for you, it will prob­a­bly need to un­der­stand what “good for you” means. In other words, un­der­stand­ing prefer­ence learn­ing can help with AI al­ign­ment re­search, which can help with ex­is­ten­tial safety. And if ex­is­ten­tial safety is your goal, you can try to tar­get your use of prefer­ence learn­ing con­cepts and meth­ods to­ward that goal.

PL ne­glect:

Prefer­ence learn­ing has always been cru­cial to the ad­ver­tis­ing in­dus­try, and as such it has not been ne­glected in re­cent years. For the same rea­son, it’s also not likely to be­come ne­glected. Its ap­pli­ca­tion to re­in­force­ment learn­ing is some­what new, how­ever, be­cause un­til re­cently there was much less ac­tive re­search in re­in­force­ment learn­ing. In other words, re­cent in­ter­est in re­ward learn­ing is mainly a func­tion of in­creased in­ter­est in re­in­force­ment learn­ing, rather than in­creased in­ter­est in prefer­ence learn­ing. If new learn­ing paradigms su­per­sede re­in­force­ment learn­ing, prefer­ence learn­ing for those paradigms will not be far be­hind.

(This is not a pop­u­lar opinion; I apol­o­gize if I have offended any­one who be­lieves that progress in prefer­ence learn­ing will re­duce ex­is­ten­tial risk, and I cer­tainly wel­come de­bate on the topic.)

PL ex­em­plars:

Re­cent works of sig­nifi­cant ed­u­ca­tional value, ac­cord­ing to me:

Hu­man-robot in­ter­ac­tion (HRI)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Hu­man-Robot In­ter­ac­tion

Sin­gle/​Single

610

710

510

410

310

HRI re­search is con­cerned with de­sign­ing and op­ti­miz­ing pat­terns of in­ter­ac­tion be­tween hu­mans and ma­chines—usu­ally ac­tual phys­i­cal robots, but not always.

HRI helpful­ness to ex­is­ten­tial safety:

On net, I think AI/​ML would be bet­ter for the world if most of its re­searchers pivoted from gen­eral AI/​ML into HRI, sim­ply be­cause it would force more AI/​ML re­searchers to more fre­quently think about real-life hu­mans and their de­sires, val­ues, and vuln­er­a­bil­ities. More­over, I think it rea­son­able (as in, >1% likely) that such a pivot might ac­tu­ally hap­pen if, say, 100 more re­searchers make this their goal.

For this rea­son, I think con­tri­bu­tions to this area to­day are pretty solidly good for ex­is­ten­tial safety, al­though not perfectly so: HRI re­search can also be used to de­ceive hu­mans, which can de­grade so­cietal-scale hon­esty norms, and I’ve seen HRI re­search tar­get­ing pre­cisely that. How­ever, my model of read­ers of this blog is that they’d be un­likely to con­tribute to those parts of HRI re­search, such that I feel pretty solidly about recom­mend­ing con­tri­bu­tions to HRI.

HRI ed­u­ca­tional value:

I think HRI work is of un­usu­ally high ed­u­ca­tional value for think­ing about ex­is­ten­tial safety, even among other top­ics in this post. The rea­son is that, by work­ing with robots, HRI work is forced to grap­ple with high-di­men­sional and con­tin­u­ous state spaces and ac­tion spaces that are too com­plex for the hu­man sub­jects in­volved to con­sciously model. This, to me, cru­cially mir­rors the re­la­tion­ship be­tween fu­ture AI tech­nol­ogy and hu­man so­ciety: hu­man­ity, col­lec­tively, will likely be un­able to con­sciously grasp the full breadth of states and ac­tions that our AI tech­nolo­gies are trans­form­ing and un­der­tak­ing for us. I think many AI re­searchers out­side of robotics are mostly blind to this difficulty, which on its own is an ar­gu­ment in fa­vor of more AI re­searchers work­ing in robotics. The beauty of HRI is that it also ex­plic­itly and con­tinu­ally thinks about real hu­man be­ings, which I think is an im­por­tant men­tal skill to prac­tice if you want to pro­tect hu­man­ity col­lec­tively from ex­is­ten­tial dis­asters.

HRI ne­glect:

A ne­glect score for this area was uniquely difficult for me to spec­ify. On one hand, HRI is a rel­a­tively es­tab­lished and vibrant area of re­search com­pared with some of the more nascent ar­eas cov­ered in this post. On the other hand, as men­tioned, I’d even­tu­ally like to see the en­tirety of AI/​ML as a field pivot­ing to­ward HRI work, which means it is still very ne­glected com­pared to where I want to see it. Fur­ther­more, I think such a pivot is ac­tu­ally rea­son­able to achieve over the next 20-30 years. Fur­ther still, I think in­dus­trial in­cen­tives might even­tu­ally sup­port this pivot, per­haps on a similar timescale.

So: if the main rea­son you care about ne­glect is that you are look­ing to pro­duce a strong founder effect, you should prob­a­bly dis­count my nu­mer­i­cal ne­glect scores for this area, given that it’s not par­tic­u­larly “small” on an ab­solute scale com­pared to the other ar­eas here. By that met­ric, I’d have given some­thing more like {2015:4/​10; 2020:3/​10; 2030:2/​10}. On the other hand, if you’re an AI/​ML re­searcher look­ing to “do the right thing” by switch­ing to an area that pretty much ev­ery­one should switch into, you definitely have my “do­ing the right thing” as­sess­ment if you switch into this area, which is why I’ve given it some­what higher ne­glect scores.

HRI ex­em­plars:

Side-effect min­i­miza­tion (SEM)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Side-effect Min­i­miza­tion

Sin­gle/​Single

410

410

610

510

410

SEM re­search is con­cerned with de­vel­op­ing do­main-gen­eral meth­ods for mak­ing AI sys­tems less likely to pro­duce side effects, es­pe­cially nega­tive side effects, in the course of pur­su­ing an ob­jec­tive or task.

SEM helpful­ness to ex­is­ten­tial safety:

I think this area has two ob­vi­ous ap­pli­ca­tions to safety-in-gen­eral:

  1. (“ac­ci­dents”) pre­vent­ing an AI agent from “mess­ing up” when perform­ing a task for its pri­mary stake­holder(s), and

  2. (“ex­ter­nal­ities”) pre­vent­ing an AI sys­tem from gen­er­at­ing prob­lems for per­sons other than its pri­mary stake­hold­ers, either

    1. (“unilat­eral ex­ter­nal­ities”) when the sys­tem gen­er­ates ex­ter­nal­ities through its unilat­eral ac­tions, or

    2. (“mul­ti­lat­eral ex­ter­nal­ities”) when the ex­ter­nal­ities are gen­er­ated through the in­ter­ac­tion of an AI sys­tem with an­other en­tity, such as a non-stake­holder or an­other AI sys­tem.

I think the ap­pli­ca­tion to ex­ter­nal­ities is more im­por­tant and valuable than the ap­pli­ca­tion to ac­ci­dents, be­cause I think ex­ter­nal­ities are (even) harder to de­tect and avoid than ac­ci­dents. More­over, I think mul­ti­lat­eral ex­ter­nal­ities are (even!) harder to avoid than unilat­eral ex­ter­nal­ities.

Cur­rently, SEM re­search is fo­cussed mostly on ac­ci­dents, which is why I’ve only given it a mod­er­ate score on the helpful­ness scale. Con­cep­tu­ally, it does make sense to fo­cus on ac­ci­dents first, then unilat­eral ex­ter­nal­ities, and then mul­ti­lat­eral ex­ter­nal­ities, be­cause of the in­creas­ing difficulty in ad­dress­ing them.

How­ever, the need to ad­dress mul­ti­lat­eral ex­ter­nal­ities will arise very quickly af­ter unilat­eral ex­ter­nal­ities are ad­dressed well enough to roll out legally ad­mis­si­ble prod­ucts, be­cause most of our le­gal sys­tems have an eas­ier time defin­ing and pun­ish­ing nega­tive out­comes that have a re­spon­si­ble party. I don’t be­lieve this is a quirk of hu­man le­gal sys­tems: when two im­perfectly al­igned agents in­ter­act, they com­plex­ify each other’s en­vi­ron­ment in a way that con­sumes more cog­ni­tive re­sources than in­ter­act­ing with a non-agen­tic en­vi­ron­ment. (This is why MARL and self-play are seen as pow­er­ful cur­ricula for learn­ing.) Thus, there is less cog­ni­tive “slack” to think about non-stake­hold­ers in a multi-agent set­ting than in a sin­gle-agent set­ting.

For this rea­son, I think work that makes it easy for AI sys­tems and their de­sign­ers to achieve com­mon knowl­edge around how the sys­tems should avoid pro­duc­ing ex­ter­nal­ities is very valuable.

SEM ed­u­ca­tional value:

I think SEM re­search thus far is of mod­er­ate ed­u­ca­tional value, mainly just to kick­start your think­ing about side effects.

SEM ne­glect:

Do­main-gen­eral side-effect min­i­miza­tion for AI is a rel­a­tively new area of re­search, and is still some­what ne­glected. More­over, I sus­pect it will re­main ne­glected, be­cause of the afore­men­tioned ten­dency for our le­gal sys­tem to pay too lit­tle at­ten­tion to mul­ti­lat­eral ex­ter­nal­ities, a key source of nega­tive side effects for so­ciety.

SEM ex­em­plars:

Re­cent ex­em­plars of value to ex­is­ten­tial safety, mostly via start­ing to think about the gen­er­al­ized con­cept of side effects at all:

In­ter­pretabil­ity in ML (In­tML)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

In­ter­pretabil­ity in ML

Sin­gle/​Single

810

610

810

610

210

In­ter­pretabil­ity re­search is con­cerned with mak­ing the rea­son­ing and de­ci­sions of AI sys­tems more in­ter­pretable to hu­mans. In­ter­pretabil­ity is closely re­lated to trans­parency and ex­plain­abil­ity. Not all au­thors treat these three con­cepts as dis­tinct; how­ever, I think when use­ful dis­tinc­tion is drawn be­tween be­tween them, it of­ten looks some­thing like this:

  • a sys­tem is “trans­par­ent” if it is easy for hu­man users or de­vel­op­ers to ob­serve and track im­por­tant pa­ram­e­ters of its in­ter­nal state;

  • a sys­tem is “ex­plain­able” if use­ful ex­pla­na­tions of its rea­son­ing can be pro­duced af­ter the fact.

  • a sys­tem is “in­ter­pretable” if its rea­son­ing is struc­tured in a man­ner that does not re­quire ad­di­tional en­g­ineer­ing work to pro­duce ac­cu­rate hu­man-leg­ible ex­pla­na­tions.

In other words, in­ter­pretable sys­tems are sys­tems with the prop­erty that trans­parency is ad­e­quate for ex­plain­abil­ity: when we look in­side them, we find they are struc­tured in a man­ner that does not re­quire much ad­di­tional ex­pla­na­tion. I see Pro­fes­sor Cyn­thia Rudin as the pri­mary ad­vo­cate for this dis­t­in­guished no­tion of in­ter­pretabil­ity, and I find it to be an im­por­tant con­cept to dis­t­in­guish.

In­tML helpful­ness to ex­is­ten­tial safety:

I think in­ter­pretabil­ity re­search con­tributes to ex­is­ten­tial safety in a fairly di­rect way on the mar­gin to­day. Speci­fi­cally, progress in in­ter­pretabil­ity will

  • de­crease the de­gree to which hu­man AI de­vel­op­ers will end up mis­judg­ing the prop­er­ties of the sys­tems they build,

  • in­crease the de­gree to which sys­tems and their de­sign­ers can be held ac­countable for the prin­ci­ples those sys­tems em­body, per­haps even be­fore those prin­ci­ples have a chance to man­i­fest in sig­nifi­cant nega­tive so­cietal-scale con­se­quences, and

  • po­ten­tially in­crease the de­gree to which com­pet­ing in­sti­tu­tions and na­tions can es­tab­lish co­op­er­a­tion and in­ter­na­tional treaties gov­ern­ing AI-heavy op­er­a­tions.

I be­lieve this last point may turn out to be the most im­por­tant ap­pli­ca­tion of in­ter­pretabil­ity work. Speci­fi­cally, I think in­sti­tu­tions that use a lot of AI tech­nol­ogy (in­clud­ing but not limited to pow­er­ful au­tonomous AI sys­tems) could be­come opaque to one an­other in a man­ner that hin­ders co­op­er­a­tion be­tween and gov­er­nance of those sys­tems. By con­trast, a de­gree of trans­parency be­tween en­tities can fa­cil­i­tate co­op­er­a­tive be­hav­ior, a phe­nomenon which has been borne out in some of the agent foun­da­tions work listed above, speci­fi­cally:

In other words, I think in­ter­pretabil­ity re­search can en­able tech­nolo­gies that le­gi­t­imize and fulfill AI gov­er­nance de­mands, nar­row­ing the gap be­tween what policy mak­ers will wish for and what tech­nol­o­gists will agree is pos­si­ble.

In­tML ed­u­ca­tional value:

I think in­ter­pretabil­ity re­search is of mod­er­ately high ed­u­ca­tional value for think­ing about ex­is­ten­tial safety, be­cause some re­search in this area is some­what sur­pris­ing in terms of show­ing ways to main­tain in­ter­pretabil­ity with­out sac­ri­fic­ing much in the way of perfor­mance. This can change our ex­pec­ta­tions about how so­ciety can and should be struc­tured to main­tain ex­is­ten­tial safety, by chang­ing the de­gree of in­ter­pretabil­ity we can and should ex­pect from AI-heavy in­sti­tu­tions and sys­tems.

In­tML ne­glect:

I think In­tML is fairly ne­glected to­day rel­a­tive to its value. How­ever, over the com­ing decade, I think there will be op­por­tu­ni­ties for com­pa­nies to speed up their de­vel­op­ment work­flows by im­prov­ing the in­ter­pretabil­ity of sys­tems to their de­vel­op­ers. In fact, I think for many com­pa­nies in­ter­pretabil­ity is go­ing to be a cru­cial bot­tle­neck for ad­vanc­ing their product de­vel­op­ment. Th­ese de­vel­op­ments won’t be my fa­vorite ap­pli­ca­tions of in­ter­pretabil­ity, and I might even­tu­ally be­come less ex­cited about con­tri­bu­tions to in­ter­pretabil­ity if all of the work seems ori­ented on com­mer­cial or mil­i­ta­rized ob­jec­tives in­stead of civic re­spon­si­bil­ities. But in any case, I think get­ting in­volved with in­ter­pretabil­ity re­search to­day is a pretty ro­bustly safe and valuable ca­reer move for any up and com­ing AI re­searchers, es­pe­cially if they do their work with an eye to­ward ex­is­ten­tial safety.

In­tML ex­em­plars:

Re­cent ex­em­plars of high value to ex­is­ten­tial safety:

Fair­ness in ML (FairML)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Fair­ness in ML

Mul­tie/​Single

610

510

710

310

210

Fair­ness re­search in ma­chine learn­ing is typ­i­cally con­cerned with al­ter­ing or con­strain­ing learn­ing sys­tems to make sure their de­ci­sions are “fair” ac­cord­ing to a va­ri­ety of defi­ni­tions of fair­ness.

FairML helpful­ness to ex­is­ten­tial safety:

My hope for FairML as a field con­tribut­ing to ex­is­ten­tial safety is three­fold:

  1. (so­cietal-scale think­ing) Fair­ness com­prises one or more hu­man val­ues that ex­ist in ser­vice of so­ciety as a whole, and which are cur­rently difficult to en­code al­gorith­mi­cally, es­pe­cially in a form that will gar­ner un­challenged con­sen­sus. Get­ting more re­searchers to think in the fram­ing “How do I en­code a value that will serve so­ciety as a whole in a broadly agree­able way” is good for big-pic­ture think­ing and hence for so­ciety-scale safety prob­lems.

  2. (so­cial con­text aware­ness) FairML gets re­searchers to “take off their blin­ders” to the com­plex­ity of so­ciety sur­round­ing them and their in­ven­tions. I think this trend is grad­u­ally giv­ing AI/​ML re­searchers a greater sense of so­cial and civic re­spon­si­bil­ity, which I think re­duces ex­is­ten­tial risk from AI/​ML.

  3. (sen­si­tivity to un­fair uses of power) Sim­ply put, it’s un­fair to place all of hu­man­ity at risk with­out giv­ing all of hu­man­ity a chance to weigh in on that risk. More fo­cus within CS on fair­ness as a hu­man value could help alle­vi­ate this risk. Speci­fi­cally, fair­ness de­bates of­ten trig­ger re­dis­tri­bu­tions of re­sources in a more equitable man­ner, thus work­ing against the over-cen­tral­iza­tion of power within a given group. I have some hope that fair­ness con­sid­er­a­tions will work against the pre­ma­ture de­ploy­ment of pow­er­ful AI/​ML sys­tems that would lead to the hy­per-cen­tral­iz­ing power over the world (and hence would pose acute global risks by be­ing a sin­gle point of failure).

  4. (Fulfilling and le­gi­t­imiz­ing gov­er­nance de­mands) Fair­ness re­search can be used to fulfill and le­gi­t­imize AI gov­er­nance de­mands, nar­row­ing the gap be­tween what policy mak­ers wish for and what tech­nol­o­gists agree is pos­si­ble. This pro­cess makes AI as a field more amenable to gov­er­nance, thereby im­prov­ing ex­is­ten­tial safety.

FairML ed­u­ca­tional value:

I think FairML re­search is of mod­er­ate ed­u­ca­tional value for think­ing about ex­is­ten­tial safety, mainly via the op­por­tu­ni­ties it cre­ates for think­ing about the points in the sec­tion on helpful­ness above. If the field were more ma­ture, I would as­sign it a higher ed­u­ca­tional value.

I should also flag that most work in FairML has not been done with ex­is­ten­tial safety in mind. Thus, I’m very much hop­ing that more peo­ple who care about ex­is­ten­tial safety will learn about FairML and be­gin think­ing about how prin­ci­ples of fair­ness can be lev­er­aged to en­sure so­cietal-scale safety in the not-too-dis­tant fu­ture.

FairML ne­glect:

FairML is not a par­tic­u­larly ne­glected area at the mo­ment be­cause there is a lot of ex­cite­ment about it, and I think it will con­tinue to grow. How­ever, it was rel­a­tively ne­glected 5 years ago, so there is still a lot of room for new ideas in the space. Also, as men­tioned, think­ing in FairML is not par­tic­u­larly ori­ented to­ward ex­is­ten­tial safety, so I think re­search on fair­ness in ser­vice of so­cietal-scale safety is quite ne­glected in my opinion.

FairML ex­em­plars:

Re­cent ex­em­plars of high value to ex­is­ten­tial safety, mostly via at­ten­tion to the prob­lem of difficult-to-cod­ify so­cietal-scale val­ues:

Com­pu­ta­tional So­cial Choice (CSC)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Com­pu­ta­tional So­cial Choice

Multi/​Single

710

710

710

510

410

Com­pu­ta­tional so­cial choice re­search is con­cerned with us­ing al­gorithms to model and im­ple­ment group-level de­ci­sions us­ing in­di­vi­d­ual-scale in­for­ma­tion and be­hav­ior as in­puts. I view CSC as a nat­u­ral next step in the evolu­tion of so­cial choice the­ory that is more at­ten­tive to the im­ple­men­ta­tion de­tails of both agents and their en­vi­ron­ments. In my con­cep­tion, CSC com­prises sub­servient top­ics in mechanism de­sign and al­gorith­mic game the­ory, even if re­searchers in those ar­eas don’t con­sider them­selves to be work­ing in com­pu­ta­tional so­cial choice.

CSC helpful­ness to ex­is­ten­tial risk:

In short, com­pu­ta­tional so­cial choice re­search will be nec­es­sary to le­gi­t­imize and fulfill gov­er­nance de­mands for tech­nol­ogy com­pa­nies (au­to­mated and hu­man-run com­pa­nies al­ike) to en­sure AI tech­nolo­gies are benefi­cial to and con­trol­lable by hu­man so­ciety. The pro­cess of suc­ceed­ing or failing to le­gi­t­imize such de­mands will lead to im­prov­ing and re­fin­ing what I like to call the al­gorith­mic so­cial con­tract: what­ever broadly agree­able set of prin­ci­ples (if any) al­gorithms are ex­pected to obey in re­la­tion to hu­man so­ciety.

In 2018, I con­sid­ered writ­ing an ar­ti­cle draw­ing more at­ten­tion to the im­por­tance of de­vel­op­ing an al­gorith­mic so­cial con­tract, but found this point had already been quite elo­quently by Iyad Rah­wan in the fol­low­ing pa­per, which I highly recom­mend:

Com­pu­ta­tional so­cial choice meth­ods in their cur­rent form are cer­tainly far from pro­vid­ing ad­e­quate and com­plete for­mu­la­tions of an al­gorith­mic so­cial con­tract. See the fol­low­ing ar­ti­cle for ar­gu­ments against tun­nel-vi­sion on com­pu­ta­tional so­cial choice as a com­plete solu­tion to so­cietal-scale AI ethics:

Notwith­stand­ing this con­cern, what fol­lows is a some­what de­tailed fore­cast of how I think com­pu­ta­tional so­cial choice re­search will still have a cru­cial role to play in de­vel­op­ing the al­gorith­mic so­cial con­tract through­out the de­vel­op­ment of in­di­vi­d­u­ally-al­ignable trans­for­ma­tive AI tech­nolo­gies, which I’ll call “the al­ign­ment rev­olu­tion”.

First, once tech­nol­ogy com­pa­nies be­gin to de­velop in­di­vi­d­u­ally-al­ignable trans­for­ma­tive AI ca­pa­bil­ities, there will be strong eco­nomic and so­cial and poli­ti­cal pres­sures for its de­vel­op­ers to sell those ca­pa­bil­ities rather than hoard­ing them. Speci­fi­cally:

  • (eco­nomic pres­sure) Sel­ling ca­pa­bil­ities im­me­di­ately gar­ners re­sources in the form of money and in­for­ma­tion from the pur­chasers and users of the ca­pa­bil­ities;

  • (so­cial pres­sure) Hoard­ing ca­pa­bil­ities could be seen as anti-so­cial rel­a­tive to dis­tribut­ing them more broadly through sales or free ser­vices;

  • (so­ciopoli­ti­cal pres­sure) Sel­ling ca­pa­bil­ities al­lows so­ciety to be­come aware that those ca­pa­bil­ities ex­ist, en­abling a smoother tran­si­tion to em­brac­ing those ca­pa­bil­ities. This cre­ates a broadly agree­able con­crete moral ar­gu­ment against ca­pa­bil­ity hoard­ing, which could be­come poli­ti­cally rele­vant.

  • (poli­ti­cal pres­sure) Poli­ti­cal elites will be hap­pier if tech­ni­cal elites “share” their ca­pa­bil­ities with the rest of the rest of the econ­omy rather than hoard­ing them.

Se­cond, for the above rea­sons, I ex­pect in­di­vi­d­u­ally-al­ignable trans­for­ma­tive AI ca­pa­bil­ities to be dis­tributed fairly broadly once they ex­ist, cre­at­ing an “al­ign­ment rev­olu­tion” aris­ing from those ca­pa­bil­ities. (It’s pos­si­ble I’m wrong about this, and for that I rea­son I also wel­come re­search on how to al­ign non-dis­tributed al­ign­ment ca­pa­bil­ities; that’s just not where most of my chips lie, and not where the rest of this ar­gu­ment will fo­cus.)

Third, un­less hu­man­ity col­lec­tively works very hard to main­tain a de­gree of sim­plic­ity and leg­i­bil­ity in the over­all struc­ture of so­ciety*, this “al­ign­ment rev­olu­tion” will greatly com­plex­ify our en­vi­ron­ment to a point of much greater in­com­pre­hen­si­bil­ity and illeg­i­bil­ity than even to­day’s world. This, in turn, will im­pov­er­ish hu­man­ity’s col­lec­tive abil­ity to keep abreast of im­por­tant in­ter­na­tional de­vel­op­ments, as well as our abil­ity to hold the in­ter­na­tional econ­omy ac­countable for main­tain­ing our hap­piness and ex­is­tence.

(*Would-be-foot­note: I have some rea­sons to be­lieve that per­haps we can and should work harder to make the global struc­ture of so­ciety more leg­ible and ac­countable to hu­man wellbe­ing, but that is a topic for an­other ar­ti­cle.)

Fourth, in such a world, al­gorithms will be needed to hold the ag­gre­gate global be­hav­ior of al­gorithms ac­countable to hu­man wellbe­ing, be­cause things will be hap­pen­ing too quickly for hu­mans to mon­i­tor. In short, an “al­gorith­mic gov­ern­ment” will be needed to gov­ern “al­gorith­mic so­ciety”. Some might ar­gue this is not strictly un­nec­es­sary: in the ab­sence of a math­e­mat­i­cally cod­ified al­gorith­mic so­cial con­tract, hu­mans could in prin­ci­ple co­or­di­nate to cease or slow down the use of these pow­er­ful new al­ign­ment tech­nolo­gies, in or­der to give our­selves more time to ad­just to and gov­ern their use. How­ever, for all our suc­cesses in in­no­vat­ing laws and gov­ern­ments, I do not be­lieve cur­rent hu­man le­gal norms are quite de­vel­oped enough to sta­bly man­age a global econ­omy em­pow­ered with in­di­vi­d­u­ally-al­ignable trans­for­ma­tive AI ca­pa­bil­ities.

Fifth, I do think our cur­rent global le­gal norms are much bet­ter than what many com­puter sci­en­tists naively proffer as re­place­ments for them. My hope is that more re­sources and in­fluence will slowly flow to­ward the ar­eas of com­puter sci­ence most in touch with the nu­ances and com­plex­ities of cod­ify­ing im­por­tant so­cietal-scale val­ues. In my opinion, this work is mostly con­cen­trated in and around com­pu­ta­tional so­cial choice, to some ex­tent mechanism de­sign, and morally ad­ja­cent yet con­cep­tu­ally nascent ar­eas of ML re­search such as fair­ness and in­ter­pretabil­ity.

While there is cur­rently an in­creas­ing flurry of (well-de­served) ac­tivity in fair­ness and in­ter­pretabil­ity re­search, com­pu­ta­tional so­cial choice is some­what more ma­ture, and has a lot for these younger fields to learn from. This is why I think CSC work is cru­cial to ex­is­ten­tial safety: it is the area of com­puter sci­ence most tai­lored to evoke re­flec­tion on the global struc­ture of so­ciety, and the most ma­ture in do­ing so.

So what does all this have to do with ex­is­ten­tial safety? Un­for­tu­nately, while CSC is sig­nifi­cantly more ma­ture as a field than in­ter­pretable ML or fair ML, it is still far from ready to fulfill gov­er­nance de­mand at the ever-in­creas­ing speed and scale needed to en­sure ex­is­ten­tial safety in the wake of in­di­vi­d­u­ally-al­ignable trans­for­ma­tive AI tech­nolo­gies. More­over, I think punt­ing these ques­tions to fu­ture AI sys­tems to solve for us is a ter­rible idea, be­cause do­ing so im­pov­er­ishes our abil­ity to san­ity-check whether those AI sys­tems are giv­ing us rea­son­able an­swers to our ques­tions about so­cial choice. So, on the mar­gin I think con­tri­bu­tions to CSC the­ory are highly valuable, es­pe­cially by per­sons think­ing about ex­is­ten­tial safety as the ob­jec­tive of their re­search.

CSC ed­u­ca­tional value:

Learn­ing about CSC is nec­es­sary for con­tri­bu­tions to CSC, which I think are cur­rently needed to en­sure ex­is­ten­tially safe so­cietal-scale norms for al­igned AI sys­tems to fol­low af­ter “the al­ign­ment rev­olu­tion” if it hap­pens. So, I think CSC is highly valuable to learn about, with the caveat that most work in CSC has not been done with ex­is­ten­tial safety in mind. Thus, I’m very much hop­ing that more peo­ple who care about ex­is­ten­tial safety will learn about and be­gin con­tribut­ing to CSC in ways that steer CSC to­ward is­sues of so­cietal-scale safety.

CSC ne­glect:

As men­tioned above, I think CSC is still far from ready to fulfill gov­er­nance de­mands at the ever-in­creas­ing speed and scale that will be needed to en­sure ex­is­ten­tial safety in the wake of “the al­ign­ment rev­olu­tion”. That said, I do think over the next 10 years CSC will be­come both more im­mi­nently nec­es­sary and more pop­u­lar, as more pres­sure falls upon tech­nol­ogy com­pa­nies to make so­cietal-scale de­ci­sions. CSC will be­come still more nec­es­sary and pop­u­lar as more hu­mans and hu­man in­sti­tu­tions be­come aug­mented with pow­er­ful al­igned AI ca­pa­bil­ities that might “change the game” that our civ­i­liza­tion is play­ing. I ex­pect such ad­vance­ments to raise in­creas­ingly deep and ur­gent ques­tions about the prin­ci­ples on which our civ­i­liza­tion is built, that will need tech­ni­cal an­swers in or­der to be fully re­solved in ways that main­tain ex­is­ten­tial safety.

CSC ex­em­plars:

CSC ex­em­plars of par­tic­u­lar value and rele­vance to ex­is­ten­tial safety, mostly via their at­ten­tion to for­mal­isms for how to struc­ture so­cietal-scale de­ci­sions:

Ac­countabil­ity in ML (Ac­cML)

Ex­ist­ing Re­search Area

So­cial Application

Helpful­ness to Ex­is­ten­tial Safety

Ed­u­ca­tional Value

2015 Neglect

2020 Neglect

2030 Neglect

Ac­countabil­ity in ML

Multi/​Multi

810

310

810

710

510

Ac­countabil­ity (Ac­cML) is aimed at mak­ing it eas­ier to hold per­sons or in­sti­tu­tions ac­countable for the effects of ML sys­tems. Ac­countabil­ity de­pends on trans­parency and ex­plain­abil­ity for eval­u­at­ing the prin­ci­ples by which a harm or mis­take oc­curs, but it is not sub­sumed by these ob­jec­tives.

Ac­cML helpful­ness to ex­is­ten­tial safety:

The rele­vance of ac­countabil­ity to ex­is­ten­tial safety is mainly via the prin­ci­ple of ac­countabil­ity gain­ing more trac­tion in gov­ern­ing the tech­nol­ogy in­dus­try. In sum­mary, the high level points I be­lieve in this area are the fol­low­ing, which are ar­gued for in more de­tail af­ter the list:

  1. Tech com­pa­nies are cur­rently “black boxes” to out­side so­ciety, in that they can de­velop and im­ple­ment (al­most) what­ever they want within the con­fines of pri­vately owned lab­o­ra­to­ries (and other “se­cure” sys­tems), and some of the things they de­velop or im­ple­ment in pri­vate set­tings could pose sig­nifi­cant harms to so­ciety.

  2. Soon (or already), so­ciety needs to be­come less per­mis­sive of tech com­pa­nies de­vel­op­ing highly po­tent al­gorithms, even in set­tings that would cur­rently be con­sid­ered “pri­vate”, similar to the way we treat phar­ma­ceu­ti­cal com­pa­nies de­vel­op­ing highly po­tent biolog­i­cal spec­i­mens.

  3. Points #1 and #2 mir­ror the way in which ML sys­tems them­selves are black boxes even to their cre­ators, which for­tu­nately is mak­ing some ML re­searchers un­com­fortable enough to start hold­ing con­fer­ences on ac­countabil­ity in ML.

  4. More re­searchers get­ting in­volved in the task of defin­ing and mon­i­tor­ing ac­countabil­ity can help tech com­pany em­ploy­ees and reg­u­la­tors to re­flect on the prin­ci­ple of ac­countabil­ity and whether tech com­pa­nies them­selves should be more sub­ject to it at var­i­ous scales (e.g., their soft­ware should be more ac­countable to its users and de­vel­op­ers, their de­vel­op­ers and users should be more ac­countable to the pub­lic, their ex­ec­u­tives should be more ac­countable to gov­ern­ments and civic so­ciety, etc.).

  5. In fu­tures where trans­for­ma­tive AI tech­nol­ogy is used to provide wide­spread ser­vices to many agents si­mul­ta­neously (e.g., “Com­pre­hen­sive AI ser­vices” sce­nar­ios), progress on defin­ing and mon­i­tor­ing ac­countabil­ity can help “in­fuse” those ser­vices with a greater de­gree of ac­countabil­ity and hence safety to the rest of the world.

What fol­lows is my nar­ra­tive for how and why I be­lieve the five points above.

At pre­sent, so­ciety is struc­tured such that it is pos­si­ble for a tech­nol­ogy com­pany to amass a huge amount of data and com­put­ing re­sources, and as long as their ac­tivi­ties are kept “pri­vate”, they are free to use those re­sources to ex­per­i­ment with de­vel­op­ing po­ten­tially mis­al­igned and highly po­tent AI tech­nolo­gies. For in­stance, if a tech com­pany to­mor­row de­vel­ops any of the fol­low­ing po­ten­tially highly po­tent tech­nolo­gies within a pri­vately owned ML lab, there are no pub­li­cly man­dated reg­u­la­tions re­gard­ing how they should han­dle or ex­per­i­ment with them:

  • mis­al­igned superintelligences

  • fake news generators

  • pow­er­ful hu­man be­hav­ior pre­dic­tion and con­trol tools

  • … any al­gorithm whatsoever

More­over, there are vir­tu­ally no pub­li­cly man­dated reg­u­la­tions against know­ingly or in­ten­tion­ally or de­vel­op­ing any of these ar­ti­facts within the con­fines of a pri­vately owned lab, de­spite the fact that the mere ex­is­tence of such an ar­ti­fact poses a threat to so­ciety. This is the sense in which tech com­pa­nies are “black boxes” to so­ciety, and po­ten­tially harm­ful as such.

(That’s point #1.)

Con­trast this situ­a­tion with the strict guidelines that phar­ma­ceu­ti­cal com­pa­nies are re­quired to ad­here to in their man­age­ment of pathogens. First, it is sim­ply ille­gal for most com­pa­nies to know­ingly de­velop syn­thetic viruses, un­less they are cer­tified to do so by demon­strat­ing a cer­tain ca­pac­ity for safe han­dling of the re­sult­ing ar­ti­facts. Se­cond, con­di­tional on hav­ing been au­tho­rized to de­velop viruses, com­pa­nies are re­quired to fol­low stan­dard­ized safety pro­to­cols. Third, com­pa­nies are sub­ject to third-party au­dits to en­sure com­pli­ance with these safety pro­to­cols, and are not sim­ply trusted to fol­low them with­out ques­tion.

Noth­ing like this is true in the tech in­dus­try, be­cause his­tor­i­cally, al­gorithms have been viewed as less po­tent so­cietal-scale risks than viruses. In­deed, pre­sent-day ac­countabil­ity norms in tech would al­low an ar­bi­trary level of dis­par­ity to de­velop between

  • the po­tency (in terms of po­ten­tial im­pact) of al­gorithms de­vel­oped in pri­vately owned lab­o­ra­to­ries, and

  • the pre­pared­ness of the rest of so­ciety to han­dle those im­pacts if the al­gorithms were re­leased (such as by ac­ci­dent, harm­ful in­tent, or poor judge­ment).

This is a mis­take, and an in­creas­ingly un­ten­able po­si­tion as the power of AI and ML tech­nol­ogy in­creases. In par­tic­u­lar, a num­ber of tech­nol­ogy com­pa­nies are in­ten­tion­ally try­ing to build ar­tifi­cial gen­eral in­tel­li­gence, an ar­ti­fact which, if re­leased, would be much more po­tent than most viruses. Th­ese com­pa­nies do in fact have safety re­searchers work­ing in­ter­nally to think about how to be safe and whether to re­lease things. But con­trast this again with phar­ma­ceu­ti­cals. It just won’t fly for a phar­ma­ceu­ti­cal com­pany to say “Don’t worry, we don’t plan to re­lease it; we’ll just make up our own rules for how to be pri­vately safe with it.”. Even­tu­ally, we should prob­a­bly stop ac­cept­ing this po­si­tion from tech com­pa­nies at well.

(That’s point #2.)

For­tu­nately, even some re­searchers and de­vel­op­ers are start­ing to be­come un­com­fortable with “black boxes” play­ing im­por­tant and con­se­quen­tial roles in so­ciety, as ev­i­denced by the re­cent in­crease in at­ten­tion on both ac­countabil­ity and in­ter­pretabil­ity in ser­vice of it, for in­stance:

This kind of dis­com­fort both fuels and is fueled by de­creas­ing lev­els of blind faith in the benefits of tech­nol­ogy in gen­eral. Signs of this broader trend in­clude:

To­gether, these trends in­di­cate a de­creas­ing level of blind faith in the ad­di­tion of novel tech­nolo­gies to so­ciety, both in the form of black-box tech prod­ucts, and black-box tech com­pa­nies.

(That’s point #3.)

The Euro­pean Gen­eral Data Pro­tec­tion Reg­u­la­tion (GDPR) is a very good step for reg­u­lat­ing how tech com­pa­nies re­late with the pub­lic. I say this know­ing that GDPR is far from perfect. The rea­son it’s still ex­tremely valuable is that it has ini­tial­ized the vari­able defin­ing hu­man­ity’s col­lec­tive bar­gain­ing po­si­tion (at least within Europe, and repli­cated to some ex­tent by the CCPA) for con­trol­ling how tech com­pa­nies use data. That vari­able can now be amended and hence im­proved upon with­out first hav­ing to ask the ques­tion “Are we even go­ing to try to reg­u­late how tech com­pa­nies use data?” For a while, it wasn’t clear any ac­tion would ever be taken on this front, out­side of spe­cific do­mains like health­care and fi­nance.

How­ever, while GDPR has defined a slope for reg­u­lat­ing the use of data, we also need ac­countabil­ity for pri­vate uses of com­put­ing. As AlphaZero demon­strates, data-free com­put­ing alone is suffi­cient to de­velop su­per-hu­man strate­gic com­pe­tence in a well-speci­fied do­main.

When will it be time to dis­al­low ar­bi­trary pri­vate uses of com­put­ing re­sources, ir­re­spec­tive of its data sources? Is it time already? My opinions on this are out­side the scope of what I in­tend to ar­gue for in this post. But when­ever the time comes to de­velop and en­force such ac­countabil­ity, it will prob­a­bly be eas­ier to do that if re­searchers and de­vel­op­ers have spent more time think­ing about what ac­countabil­ity is, what pur­poses are served by var­i­ous ver­sions of ac­countabil­ity, and how to achieve those kinds of ac­countabil­ity in both fully-au­to­mated and semi-au­to­mated sys­tems. In other words, op­ti­misti­cally, more tech­ni­cal re­search on ac­countabil­ity in ML might re­sult in more ML re­searchers trans­fer­ring their aware­ness that «black box tech prod­ucts are in­suffi­ciently ac­countable» to be­come more aware/​con­vinced that «black box tech com­pa­nies are in­suffi­ciently ac­countable».

(That’s point #4.)

But even if that trans­fer of aware­ness doesn’t hap­pen, au­to­mated ap­proaches to ac­countabil­ity will still have a role to play if we end up in a fu­ture with large num­bers of agents mak­ing use of AI-me­di­ated ser­vices, such as in the “Com­pre­hen­sive AI Ser­vices” model of the fu­ture. Speci­fi­cally,

  • in­di­vi­d­ual ac­tors in a CAIS econ­omy should be ac­countable to the prin­ci­ple of not pri­vately de­vel­op­ing highly po­tent tech­nolo­gies with­out ad­her­ing to pub­li­cly le­gi­t­imized and au­ditable safety pro­ce­dures, and

  • sys­tems for re­flect­ing on and up­dat­ing ac­countabil­ity struc­tures can be used to de­tect and re­me­di­ate prob­le­matic be­hav­iors in multi-agent sys­tems, in­clud­ing be­hav­iors that could yield ex­is­ten­tial risks from dis­tributed sys­tems (e.g., ex­treme re­source con­sump­tion or pol­lu­tion effects).

(That’s point #5)

Ac­cML ed­u­ca­tional value:

Un­for­tu­nately, tech­ni­cal work in this area is highly un­de­vel­oped, which is why I have as­signed this area a rel­a­tively low ed­u­ca­tional value. I hope this does not trig­ger peo­ple to avoid con­tribut­ing to it.

Ac­cML ne­glect:

Cor­re­spond­ingly, this area is highly ne­glected rel­a­tive to where I’d like it to be, on top of be­ing very small in terms of the amount of tech­ni­cal work at its core.

Ac­cML ex­em­plars:

Re­cent ex­am­ples of writ­ing in Ac­cML that I think are of par­tic­u­lar value to ex­is­ten­tial safety in­clude:

Conclusion

Thanks for read­ing! I hope this post has been helpful to your think­ing about the value of a va­ri­ety of re­search ar­eas for ex­is­ten­tial safety, or at the very least, your model of my think­ing. As a re­minder, these opinions are my own, and are not in­tended to rep­re­sent any in­sti­tu­tion of which I am a part.

Reflec­tions on scope & omissions

This post has been about:

  • Re­search, not in­di­vi­d­u­als. Some read­ers might be in­ter­ested in the ques­tion “What about so-and-so’s work at such-and-such in­sti­tu­tion?” I think that’s a fair ques­tion, but I pre­fer this post to be about ideas, not in­di­vi­d­ual peo­ple. The rea­son is that I want to say both pos­i­tive and nega­tive things about each area, whereas I’m not pre­pared to write up pub­lic state­ments of pos­i­tive and nega­tive judge­ments about peo­ple (e.g., “Such-and-such is not go­ing to suc­ceed in their ap­proach”, or “So-and-so seems fun­da­men­tally mis­guided about X”.)

  • Areas, not di­rec­tions. This post is an ap­praisal of ac­tive ar­eas of re­search—top­ics with groups of peo­ple already work­ing on them writ­ing up their find­ings. It’s pri­mar­ily not an ap­praisal of po­ten­tial di­rec­tions—ways I think ar­eas of re­search could change or be sig­nifi­cantly im­proved (al­though I do some­times com­ment on di­rec­tions I’d like to see each area tak­ing). For in­stance, I think in­tent al­ign­ment is an in­ter­est­ing topic, but the cur­rent paucity of pub­li­cly available tech­ni­cal writ­ing on it makes it difficult to cri­tique. As such, I think of in­tent al­ign­ment as a “di­rec­tion” that AI al­ign­ment re­search could be taken in, rather than an “area”.