AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Just a year ago we re­leased a two part epi­sode ti­tled An Overview of Tech­ni­cal AI Align­ment with Ro­hin Shah. That con­ver­sa­tion pro­vided de­tails on the views of cen­tral AI al­ign­ment re­search or­ga­ni­za­tions and many of the on­go­ing re­search efforts for de­sign­ing safe and al­igned sys­tems. Much has hap­pened in the past twelve months, so we’ve in­vited Ro­hin — along with fel­low re­searcher Buck Sh­legeris — back for a fol­low-up con­ver­sa­tion. To­day’s epi­sode fo­cuses es­pe­cially on the state of cur­rent re­search efforts for benefi­cial AI, as well as Buck’s and Ro­hin’s thoughts about the vary­ing ap­proaches and the difficul­ties we still face. This pod­cast thus serves as a non-ex­haus­tive overview of how the field of AI al­ign­ment has up­dated and how think­ing is pro­gress­ing.

Topics dis­cussed in this epi­sode in­clude:

  • Ro­hin’s and Buck’s op­ti­misms and pes­simism about differ­ent ap­proaches to al­igned AI

  • Tra­di­tional ar­gu­ments for AI as an x-risk

  • Model­ing agents as ex­pected util­ity maximizers

  • Am­bi­tious value learn­ing and speci­fi­ca­tion learn­ing/​nar­row value learning

  • Agency and optimization

  • Robustness

  • Scal­ing to su­per­hu­man abilities

  • Universality

  • Im­pact regularization

  • Causal mod­els, or­a­cles, and de­ci­sion theory

  • Dis­con­tin­u­ous and con­tin­u­ous take­off scenarios

  • Prob­a­bil­ity of AI-in­duced ex­is­ten­tial risk

  • Timelines for AGI

  • In­for­ma­tion hazards

You can find the page for this pod­cast here: https://​​fu­ture­oflife.org/​​2020/​​04/​​15/​​an-overview-of-tech­ni­cal-ai-al­ign­ment-in-2018-and-2019-with-buck-shlegeris-and-ro­hin-shah/​​

Transcript

Note: The fol­low­ing tran­script has been ed­ited for style and clar­ity.


Lu­cas Perry: Wel­come to the AI Align­ment Pod­cast. I’m Lu­cas Perry. To­day we have a spe­cial epi­sode with Buck Sh­legeris and Ro­hin Shah that serves as a re­view of progress in tech­ni­cal AI al­ign­ment over 2018 and 2019. This epi­sode serves as an awe­some birds eye view of the vary­ing fo­cus ar­eas of tech­ni­cal AI al­ign­ment re­search and also helps to de­velop a sense of the field. I found this con­ver­sa­tion to be su­per valuable for helping me to bet­ter un­der­stand the state and cur­rent tra­jec­tory of tech­ni­cal AI al­ign­ment re­search. This pod­cast cov­ers tra­di­tional ar­gu­ments for AI as an x-risk, what AI al­ign­ment is, the mod­el­ing of agents as ex­pected util­ity max­i­miz­ers, iter­ated dis­til­la­tion and am­plifi­ca­tion, AI safety via de­bate, agency and op­ti­miza­tion, value learn­ing, ro­bust­ness, scal­ing to su­per­hu­man abil­ities, and more. The struc­ture of this pod­cast is based on Ro­hin’s AI Align­ment Fo­rum post ti­tled AI Align­ment 2018-19 Re­view. That post is an ex­cel­lent re­source to take a look at in ad­di­tion to this pod­cast. Ro­hin also had a con­ver­sa­tion with us about just a year ago ti­tled An Overview of Tech­ni­cal AI Align­ment with Ro­hin shah. This epi­sode serves as a fol­low up to that overview and as an up­date to what’s been go­ing on in the field. You can find a link for it on the page for this epi­sode.

Buck Sh­legeris is a re­searcher at the Ma­chine In­tel­li­gence Re­search In­sti­tute. He tries to work to make the fu­ture good for sen­tient be­ings and cur­rently be­lieves that work­ing on ex­is­ten­tial risk from ar­tifi­cial in­tel­li­gence is the best way of do­ing this. Buck worked as a soft­ware en­g­ineer at PayPal be­fore join­ing MIRI, and was the first em­ployee at Triple­byte. He pre­vi­ously stud­ied at the Aus­tralian Na­tional Univer­sity, ma­jor­ing in CS and minor­ing in math and physics, and he has pre­sented work on data struc­ture syn­the­sis at in­dus­try con­fer­ences.

Ro­hin Shah is a 6th year PhD stu­dent in Com­puter Science at the Cen­ter for Hu­man-Com­pat­i­ble AI at UC Berkeley. He is in­volved in Effec­tive Altru­ism and was the co-pres­i­dent of EA UC Berkeley for 2015-16 and ran EA UW dur­ing 2016-2017. Out of con­cern for an­i­mal welfare, Ro­hin is al­most ve­gan be­cause of the in­tense suffer­ing on fac­tory farms. He is in­ter­ested in AI, ma­chine learn­ing, pro­gram­ming lan­guages, com­plex­ity the­ory, al­gorithms, se­cu­rity, and quan­tum com­put­ing to name a few. Ro­hin’s re­search fo­cuses on build­ing safe and al­igned AI sys­tems that pur­sue the ob­jec­tives their users in­tend them to pur­sue, rather than the ob­jec­tives that were liter­ally speci­fied. He also pub­lishes the Align­ment Newslet­ter, which sum­ma­rizes work rele­vant to AI al­ign­ment. The Align­ment Newslet­ter is some­thing I highly recom­mend that you fol­low in ad­di­tion to this pod­cast.

And with that, let’s get into our re­view of AI al­ign­ment with Ro­hin Shah and Buck Sh­legeris.

To get things started here, the plan is to go through Ro­hin’s post on the Align­ment Fo­rum about AI Align­ment 2018 and 2019 In Re­view. We’ll be us­ing this as a way of struc­tur­ing this con­ver­sa­tion and as a way of mov­ing me­thod­i­cally through things that have changed or up­dated in 2018 and 2019, and to use those as a place for con­ver­sa­tion. So then, Ro­hin, you can start us off by go­ing through this doc­u­ment. Let’s start at the be­gin­ning, and we’ll move through se­quen­tially and jump in where nec­es­sary or where there is in­ter­est.

Ro­hin Shah: Sure, that sounds good. I think I started this out by talk­ing about this ba­sic anal­y­sis of AI risk that’s been hap­pen­ing for the last cou­ple of years. In par­tic­u­lar, you have these tra­di­tional ar­gu­ments, so maybe I’ll just talk about the tra­di­tional ar­gu­ment first, which ba­si­cally says that the AI sys­tems that we’re go­ing to build are go­ing to be pow­er­ful op­ti­miz­ers. When you op­ti­mize some­thing, you tend to get these sort of edge case out­comes, these ex­treme out­comes that are a lit­tle hard to pre­dict ahead of time.

You can’t just rely on tests with less pow­er­ful sys­tems in or­der to pre­dict what will hap­pen, and so you can’t rely on your nor­mal com­mon sense rea­son­ing in or­der to deal with this. In par­tic­u­lar, pow­er­ful AI sys­tems are prob­a­bly go­ing to look like ex­pected util­ity max­i­miz­ers due to var­i­ous co­her­ence ar­gu­ments, like the Von Neu­mann–Mor­gen­stern ra­tio­nal­ity the­o­rem, and these ex­pected util­ity max­i­miz­ers have con­ver­gent in­stru­men­tal sub-goals, like not want­ing to be switched off be­cause then they can’t achieve their goal, and want­ing to ac­cu­mu­late a lot of power and re­sources.

The stan­dard ar­gu­ment goes, be­cause AI sys­tems are go­ing to be built this way, they will have these con­ver­gent in­stru­men­tal sub-goals. This makes them dan­ger­ous be­cause they will be pur­su­ing goals that we don’t want.

Lu­cas Perry: Be­fore we con­tinue too much deeper into this, I’d want to ac­tu­ally start off with a re­ally sim­ple ques­tion for both of you. What is AI al­ign­ment?

Ro­hin Shah: Differ­ent peo­ple mean differ­ent things by it. When I use the word al­ign­ment, I’m usu­ally talk­ing about what has been more speci­fi­cally called in­tent al­ign­ment, which is ba­si­cally aiming for the prop­erty that the AI sys­tem is try­ing to do what you want. It’s try­ing to help you. Pos­si­bly it doesn’t know ex­actly how to best help you, and it might make some mis­takes in the pro­cess of try­ing to help you, but re­ally what it’s try­ing to do is to help you.

Buck Sh­legeris: The way I would say what I mean by AI al­ign­ment, I guess I would step back a lit­tle bit, and think about why it is that I care about this ques­tion at all. I think that the fun­da­men­tal fact which has me in­ter­ested in any­thing about pow­er­ful AI sys­tems of the fu­ture is that I think they’ll be a big deal in some way or an­other. And when I ask my­self the ques­tion “what are the kinds of things that could be prob­lems about how these re­ally pow­er­ful AI sys­tems work or af­fect the world”, one of the things which feels like a prob­lem is that, we might not know how to ap­ply these sys­tems re­li­ably to the kinds of prob­lems which we care about, and so by de­fault hu­man­ity will end up ap­ply­ing them in ways that lead to re­ally bad out­comes. And so I guess, from that per­spec­tive, when I think about AI al­ign­ment, I think about try­ing to make ways of build­ing AI sys­tems such that we can ap­ply them to tasks that are valuable, such that that they’ll re­li­ably pur­sue those tasks in­stead of do­ing some­thing else which is re­ally dan­ger­ous and bad.

I’m fine with in­tent al­ign­ment as the fo­cus. I kind of agree with, for in­stance, Paul Chris­ti­ano, that it’s not my prob­lem if my AI sys­tem in­com­pe­tently kills ev­ery­one, that’s the ca­pa­bil­ity’s peo­ple’s prob­lem. I just want to make the sys­tem so it’s try­ing to cause good out­comes.

Lu­cas Perry: Both of these un­der­stand­ings of what it means to build benefi­cial AI or al­igned AI sys­tems can take us back to what Ro­hin was just talk­ing about, where there’s this ba­sic anal­y­sis of AI risk, about AI as pow­er­ful op­ti­miz­ers and the as­so­ci­ated risks there. With that fram­ing and those defi­ni­tions, Ro­hin, can you take us back into this ba­sic anal­y­sis of AI risk?

Ro­hin Shah: Sure. The tra­di­tional ar­gu­ment looks like AI sys­tems are go­ing to be goal-di­rected. If you ex­pect that your AI sys­tem is go­ing to be goal-di­rected, and that goal is not the one that hu­mans care about, then it’s go­ing to be dan­ger­ous be­cause it’s go­ing to try to gain power and re­sources with which to achieve its goal.

If the hu­mans tried to turn it off, it’s go­ing to say, “No, don’t do that,” and it’s go­ing to try to take ac­tions that avoid that. So it pits the AI and the hu­mans in an ad­ver­sar­ial game with each other, and you ideally don’t want to be fight­ing against a su­per­in­tel­li­gent AI sys­tem. That seems bad.

Buck Sh­legeris: I feel like Ro­hin is to some ex­tent set­ting this up in a way that he’s then go­ing to ar­gue is wrong, which I think is kind of un­fair. In par­tic­u­lar, Ro­hin, I think you’re mak­ing these points about VNM the­o­rems and stuff to set up the fact that it seems like these ar­gu­ments don’t ac­tu­ally work. I feel that this makes it kind of un­fairly sound like the ear­lier AI al­ign­ment ar­gu­ments are wrong. I think this is an in­cred­ibly im­por­tant ques­tion, of whether early ar­gu­ments about the im­por­tance of AI safety were quite flawed. My im­pres­sion is that over­all the early ar­gu­ments about AI safety were pretty good. And I think it’s a very in­ter­est­ing ques­tion whether this is in fact true. And I’d be in­ter­ested in ar­gu­ing about it, but I think it’s the kind of thing that ought to be ar­gued about ex­plic­itly.

Ro­hin Shah: Yeah, sure.

Buck Sh­legeris: And I get that you were kind of say­ing it nar­ra­tively, so this is only a minor com­plaint. It’s a thing I wanted to note.

Ro­hin Shah: I think my po­si­tion on that ques­tion of “how good were the early AI risk ar­gu­ments,” prob­a­bly peo­ple’s in­ter­nal be­liefs were good as to why AI was sup­posed to be risky, and the things they wrote down were not very good. Some things were good and some things weren’t. I think In­tel­li­gence Ex­plo­sion Microe­co­nomics was good. I think AI Align­ment: Why It’s Hard and Where to Start, was mis­lead­ing.

Buck Sh­legeris: I think I agree with your sense that peo­ple prob­a­bly had a lot of rea­son­able be­liefs but that the writ­ten ar­gu­ments seem flawed. I think an­other thing that’s true is that ran­dom peo­ple like me who were on LessWrong in 2012 or some­thing, ended up hav­ing a lot of re­ally stupid be­liefs about AI al­ign­ment, which I think isn’t re­ally the fault of the peo­ple who were think­ing about it the best, but is maybe so­ciolog­i­cally in­ter­est­ing.

Ro­hin Shah: Yes, that seems plau­si­ble to me. Don’t have a strong opinion on it.

Lu­cas Perry: To provide a lit­tle bit of fram­ing here and bet­ter anal­y­sis of ba­sic AI x-risk ar­gu­ments, can you list what the start­ing ar­gu­ments for AI risk were?

Ro­hin Shah: I think I am rea­son­ably well por­tray­ing what the writ­ten ar­gu­ments were. Un­der­ly­ing ar­gu­ments that peo­ple prob­a­bly had would be some­thing more like, “Well, it sure seems like if you want to do use­ful things in the world, you need to have AI sys­tems that are pur­su­ing goals.” If you have some­thing that’s more like tool AI, like Google Maps, that sys­tem is go­ing to be good at the one thing it was de­signed to do, but it’s not go­ing to be able to learn and then ap­ply its knowl­edge to new tasks au­tonomously. It sure seems like if you want to do re­ally pow­er­ful things in the world, like run com­pa­nies or make poli­cies, you prob­a­bly do need AI sys­tems that are con­stantly learn­ing about their world and ap­ply­ing their knowl­edge in or­der to come up with new ways to do things.

In the his­tory of hu­man thought, we just don’t seem to know of a way to cause that to hap­pen ex­cept by putting goals in sys­tems, and so prob­a­bly AI sys­tems are go­ing to be goal-di­rected. And one way you can for­mal­ize goal-di­rect­ed­ness is by think­ing about ex­pected util­ity max­i­miz­ers, and peo­ple did a bunch of for­mal anal­y­sis of that. Mostly go­ing to ig­nore it be­cause I think you can just say all the same thing with the idea of pur­su­ing goals and it’s all fine.

Buck Sh­legeris: I think one im­por­tant clar­ifi­ca­tion to that, is you were say­ing the rea­son that tool AIs aren’t just the whole story of what hap­pens with AI is that you can’t ap­ply it to all prob­lems. I think an­other im­por­tant el­e­ment is that peo­ple back then, and I now, be­lieve that if you want to build a re­ally good tool, you’re prob­a­bly go­ing to end up want­ing to struc­ture that as an agent in­ter­nally. And even if you aren’t try­ing to struc­ture it as an agent, if you’re just search­ing over lots of differ­ent pro­grams im­plic­itly, per­haps by train­ing a re­ally large re­cur­rent policy, you’re go­ing to end up find­ing some­thing agent shaped.

Ro­hin Shah: I don’t dis­agree with any of that. I think we were us­ing the words tool AI differ­ently.

Buck Sh­legeris: Okay.

Ro­hin Shah: In my mind, if we’re talk­ing about tool AI, we’re imag­in­ing a pretty re­stricted ac­tion space where no mat­ter what ac­tions in this ac­tion space are taken, with high prob­a­bil­ity, noth­ing bad is go­ing to hap­pen. And you’ll search within that ac­tion space, but you don’t go to ar­bi­trary ac­tion in the real world or some­thing like that. This is what makes tool AI hard to ap­ply to all prob­lems.

Buck Sh­legeris: I would have thought that’s a pretty non-stan­dard use of the term tool AI.

Ro­hin Shah: Pos­si­bly.

Buck Sh­legeris: In par­tic­u­lar, I would have thought that re­strict­ing the ac­tion space enough that you’re safe, re­gard­less of how much it wants to hurt you, seems kind of non-stan­dard.

Ro­hin Shah: Yes. I have never re­ally liked the con­cept of tool AI very much, so I kind of just want to move on.

Lu­cas Perry: Hey, It’s post-pod­cast Lu­cas here. I just want to high­light here a lit­tle bit of clar­ifi­ca­tion that Ro­hin was in­ter­ested in adding, which is that he thinks that “tool AI evokes a sense of many differ­ent prop­er­ties that he doesn’t know which prop­er­ties most peo­ple are usu­ally think­ing about and as a re­sult he prefers not to use the phrase tool AI. And in­stead would like to use more pre­cise ter­minol­ogy. He doesn’t nec­es­sar­ily feel though that the con­cepts un­der­ly­ing tool AI are use­less.” So let’s tie things a bit back to these ba­sic ar­gu­ments for x-risk that many peo­ple are fa­mil­iar with, that have to do with con­ver­gent in­stru­men­tal sub-goals and the difficulty of spec­i­fy­ing and al­ign­ing sys­tems with our goals and what we ac­tu­ally care about in our prefer­ence hi­er­ar­chies.

One of the things here that Buck was seem­ing to bring up, he was say­ing that you may have been nar­ra­tively set­ting up the Von Neu­mann–Mor­gen­stern the­o­rem, which sets up AIs as ex­pected util­ity max­i­miz­ers, and that you are go­ing to ar­gue that that ar­gu­ment, which is sort of the for­mal­iza­tion of these ear­lier AI risk ar­gu­ments, that that is less con­vinc­ing to you now than it was be­fore, but Buck still thinks that these ar­gu­ments are strong. Could you un­pack this a lit­tle bit more or am I get­ting this right?

Ro­hin Shah: To be clear, I also agree with Buck, that the spirit of the origi­nal ar­gu­ments does seem cor­rect, though, there are peo­ple who dis­agree with both of us about that. Ba­si­cally, the VNM the­o­rem roughly says, if you have prefer­ences over a set of out­comes, and you satisfy some pretty in­tu­itive ax­ioms about how you make de­ci­sions, then you can rep­re­sent your prefer­ences us­ing a util­ity func­tion such that your de­ci­sions will always be, choose the ac­tion that max­i­mizes the ex­pected util­ity. This is, at least in writ­ing, given as a rea­son to ex­pect that AI sys­tems would be max­i­miz­ing ex­pected util­ity. The thing is, when you talk about AI sys­tems that are act­ing in the real world, they’re just se­lect­ing a uni­verse his­tory, if you will. Any ob­served be­hav­ior is com­pat­i­ble with the max­i­miza­tion of some util­ity func­tion. Utility func­tions are a re­ally, re­ally broad class of things when you ap­ply it to choos­ing from uni­verse his­to­ries.

Buck Sh­legeris: An in­tu­itive ex­am­ple of this: sup­pose that you see that ev­ery day I walk home from work in a re­ally in­effi­cient way. It’s im­pos­si­ble to know whether I’m do­ing that be­cause I hap­pened to re­ally like that path. For any se­quence of ac­tions that I take, there’s some util­ity func­tions such that that was the op­ti­mal se­quence of ac­tions. And so we don’t ac­tu­ally learn any­thing about how my policy is con­strained based on the fact that I’m an ex­pected util­ity max­i­mizer.

Lu­cas Perry: Right. If I only had ac­cess to your be­hav­ior and not your in­sides.

Ro­hin Shah: Yeah, ex­actly. If you have a robot twitch­ing for­ever, that’s all it does, there is a util­ity func­tion over a uni­verse his­tory that says that is the op­ti­mal thing to do. Every time the robot twitches to the right, it’s like, yeah, the thing that was op­ti­mal to do at that mo­ment in time was twitch­ing to the right. If at some point some­body takes a ham­mer and smashes the robot and it breaks, then the util­ity func­tion that cor­re­sponds to that be­ing op­ti­mal is like, yeah, that was the ex­act right mo­ment to break down.

If you have these patholog­i­cally com­plex util­ity func­tions as pos­si­bil­ities, ev­ery be­hav­ior is com­pat­i­ble with max­i­miz­ing ex­pected util­ity, you might want to say some­thing like, prob­a­bly we’ll have the sim­ple util­ity max­i­miz­ers, but that’s a pretty strong as­sump­tion, and you’d need to jus­tify it some­how. And the VNM the­o­rem wouldn’t let you do that.

Lu­cas Perry: So is the prob­lem here that you’re un­able to fully ex­tract hu­man prefer­ence hi­er­ar­chies from hu­man be­hav­ior?

Ro­hin Shah: Well, you’re un­able to ex­tract agent prefer­ences from agent be­hav­ior. You can see any agent be­hav­ior and you can ra­tio­nal­ize it as ex­pected util­ity max­i­miza­tion, but it’s not very use­ful. Doesn’t give you pre­dic­tive power.

Buck Sh­legeris: I just want to have my go at say­ing this ar­gu­ment in three sen­tences. Once upon a time, peo­ple said that be­cause all ra­tio­nal sys­tems act like they’re max­i­miz­ing an ex­pected util­ity func­tion, we should ex­pect them to have var­i­ous be­hav­iors like try­ing to max­i­mize the amount of power they have. But ev­ery set of ac­tions that you could take is con­sis­tent with be­ing an ex­pected util­ity max­i­mizer, there­fore you can’t use the fact that some­thing is an ex­pected util­ity max­i­mizer in or­der to ar­gue that it will have a par­tic­u­lar set of be­hav­iors, with­out mak­ing a bunch of ad­di­tional ar­gu­ments. And I ba­si­cally think that I was wrong to be per­suaded by the naive ar­gu­ment that Ro­hin was de­scribing, which just goes di­rectly from ra­tio­nal things are ex­pected util­ity max­i­miz­ers, to there­fore ra­tio­nal things are power max­i­miz­ing.

Ro­hin Shah: To be clear, this was the thing I also be­lieved. The main rea­son I wrote the post that ar­gued against it was be­cause I spent half a year un­der the delu­sion that this was a valid ar­gu­ment.

Lu­cas Perry: Just for my un­der­stand­ing here, the view is that be­cause any be­hav­ior, any agent from the out­side can be un­der­stood as be­ing an ex­pected util­ity max­i­mizer, that there are be­hav­iors that clearly do not do in­stru­men­tal sub-goal things, like max­i­mize power and re­sources, yet those things can still be viewed as ex­pected util­ity max­i­miz­ers from the out­side. So ad­di­tional ar­gu­ments are re­quired for why ex­pected util­ity max­i­miz­ers do in­stru­men­tal sub-goal things, which are AI risky.

Ro­hin Shah: Yeah, that’s ex­actly right.

Lu­cas Perry: Okay. What else is on offer other than ex­pected util­ity max­i­miz­ers? You guys talked about com­pre­hen­sive AI ser­vices might be one. Are there other for­mal agen­tive classes of ‘thing that is not an ex­pected util­ity max­i­mizer but still has goals?’

Ro­hin Shah: A for­mal­ism for that? I think some peo­ple like John Went­worth is for ex­am­ple, think­ing about mar­kets as a model of agency. Some peo­ple like to think of multi-agent groups to­gether lead­ing to an emer­gent agency and want to model hu­man minds this way. How for­mal are these? Not that for­mal yet.

Buck Sh­legeris: I don’t think there’s any­thing which is com­pet­i­tively pop­u­lar with ex­pected util­ity max­i­miza­tion as the frame­work for think­ing about this stuff.

Ro­hin Shah: Oh yes, cer­tainly not. Ex­pected util­ity max­i­miza­tion is used ev­ery­where. Noth­ing else comes any­where close.

Lu­cas Perry: So there’s been this com­plete fo­cus on util­ity func­tions and rep­re­sent­ing the hu­man util­ity func­tion, what­ever that means. Do you guys think that this is go­ing to con­tinue to be the pri­mary way of think­ing about and mod­el­ing hu­man prefer­ence hi­er­ar­chies? How much does it ac­tu­ally re­late to hu­man prefer­ence hi­er­ar­chies? I’m won­der­ing if it might just be sub­stan­tially differ­ent in some way.

Buck Sh­legeris: Me and Ro­hin are go­ing to dis­agree about this. I think that try­ing to model hu­man prefer­ences as a util­ity func­tion is re­ally dumb and bad and will not help you do things that are use­ful. I don’t know; If I want to make an AI that’s in­cred­ibly good at recom­mend­ing me movies that I’m go­ing to like, some kind of value learn­ing thing where it tries to learn my util­ity func­tion over movies is plau­si­bly a good idea. Even things where I’m try­ing to use an AI sys­tem as a re­cep­tion­ist, I can imag­ine value learn­ing be­ing a good idea.

But I feel ex­tremely pes­simistic about more am­bi­tious value learn­ing kinds of things, where I try to, for ex­am­ple, have an AI sys­tem which learns hu­man prefer­ences and then acts in large scale ways in the world. I ba­si­cally feel pretty pes­simistic about ev­ery al­ign­ment strat­egy which goes via that kind of a route. I feel much bet­ter about ei­ther try­ing to not use AI sys­tems for prob­lems where you have to think about large scale hu­man prefer­ences, or hav­ing an AI sys­tem which does some­thing more like mod­el­ing what hu­mans would say in re­sponse to var­i­ous ques­tions and then us­ing that di­rectly in­stead of try­ing to get a value func­tion out of it.

Ro­hin Shah: Yeah. Fun­nily enough, I was go­ing to start off by say­ing I think Buck and I are go­ing to agree on this.

Buck Sh­legeris: Oh.

Ro­hin Shah: And I think I mostly agree with the things that you said. The thing I was go­ing to say was I feel pretty pes­simistic about try­ing to model the nor­ma­tive un­der­ly­ing hu­man val­ues, where you have to get things like pop­u­la­tion ethics right, and what to do with the pos­si­bil­ity of in­finite value. How do you deal with fa­nat­i­cism? What’s up with moral un­cer­tainty? I feel pretty pes­simistic about any sort of scheme that in­volves figur­ing that out be­fore de­vel­op­ing hu­man-level AI sys­tems.

There’s a re­lated con­cept which is also called value learn­ing, which I would pre­fer to be called some­thing else, but I feel like the name’s locked in now. In my se­quence, I called it nar­row value learn­ing, but even that feels bad. Maybe at least for this pod­cast we could call it speci­fi­ca­tion learn­ing, which is sort of more like the tasks Buck men­tioned, like if you want to learn prefer­ences over movies, rep­re­sent­ing that us­ing a util­ity func­tion seems fine.

Lu­cas Perry: Like su­perfi­cial prefer­ences?

Ro­hin Shah: Sure. I usu­ally think of it as you have in mind a task that you want your AI sys­tem to do, and now you have to get your AI sys­tem to re­li­ably do it. It’s un­clear whether this should even be called a value learn­ing at this point. Maybe it’s just the en­tire al­ign­ment prob­lem. But tech­niques like in­verse re­in­force­ment learn­ing, prefer­ence learn­ing, learn­ing from cor­rec­tions, in­verse re­ward de­sign where you learn from a proxy re­ward, all of these are more try­ing to do the thing where you have a set of be­hav­iors in mind, and you want to com­mu­ni­cate that to the agent.

Buck Sh­legeris: The way that I’ve been think­ing about how op­ti­mistic I should be about value learn­ing or speci­fi­ca­tion learn­ing re­cently has been that I sus­pect that at the point where AI is hu­man level, by de­fault we’ll have value learn­ing which is about at hu­man level. We’re about as good at giv­ing AI sys­tems in­for­ma­tion about our prefer­ences that it can do stuff with as we are giv­ing other hu­mans in­for­ma­tion about our prefer­ences that we can do stuff with. And when I imag­ine hiring some­one to recom­mend mu­sic to me, I feel like there are prob­a­bly mu­sic nerds who could do a pretty good job of look­ing at my Spo­tify his­tory, and recom­mend­ing bands that I’d like if they spent a week on it. I feel a lot more pes­simistic about be­ing able to talk to a philoso­pher for a week, and then them an­swer hard ques­tions about my prefer­ences, es­pe­cially if they didn’t have the ad­van­tage of already be­ing hu­mans them­selves.

Ro­hin Shah: Yep. That seems right.

Buck Sh­legeris: So maybe that’s how I would sep­a­rate out the speci­fi­ca­tion learn­ing stuff that I feel op­ti­mistic about from the more am­bi­tious value learn­ing stuff that I feel pretty pes­simistic about.

Ro­hin Shah: I do want to note that I col­lated a bunch of stuff ar­gu­ing against am­bi­tious value learn­ing. If I had to make a case for op­ti­mism about even that ap­proach, it would look more like, “Un­der the value learn­ing ap­proach, it seems pos­si­ble with un­cer­tainty over re­wards, val­ues, prefer­ences, what­ever you want to call them to get an AI sys­tem such that you ac­tu­ally are able to change it, be­cause it would rea­son that if you’re try­ing to change it, well then that means some­thing about it is cur­rently not good for helping you and so it would be bet­ter to let it­self be changed. I’m not very con­vinced by this ar­gu­ment.”

Buck Sh­legeris: I feel like if you try to write down four differ­ent util­ity func­tions that the agent is un­cer­tain be­tween, I think it’s just ac­tu­ally re­ally hard for me to imag­ine con­crete sce­nar­ios where the AI is cor­rigible as a re­sult of its un­cer­tainty over util­ity func­tions. Imag­ine the AI sys­tem thinks that you’re go­ing to switch it off and re­place it with an AI sys­tem which has a differ­ent method of in­fer­ring val­ues from your ac­tions and your words. It’s not go­ing to want to let you do that, be­cause its util­ity func­tion is to have the world be the way that is ex­pressed by your util­ity func­tion as es­ti­mated the way that it ap­prox­i­mates util­ity func­tions. And so be­ing re­placed by a thing which es­ti­mates util­ity func­tions or in­fers util­ity func­tions some other way means that it’s very un­likely to get what it ac­tu­ally wants, and other ar­gu­ments like this. I’m not sure if these are su­per old ar­gu­ments that you’re five lev­els of counter-ar­gu­ments to.

Ro­hin Shah: I definitely know this ar­gu­ment. I think the prob­lem of fully up­dated defer­ence is what I would nor­mally point to as rep­re­sent­ing this gen­eral class of claims and I think it’s a good counter ar­gu­ment. When I ac­tu­ally think about this, I sort of start get­ting con­fused about what it means for an AI sys­tem to ter­mi­nally value the fi­nal out­put of what its value learn­ing sys­tem would do. It feels like some ad­di­tional no­tion of how the AI chooses ac­tions has been posited, that hasn’t ac­tu­ally been cap­tured in the model and so I feel fairly un­cer­tain about all of these ar­gu­ments and kind of want to defer to the fu­ture.

Buck Sh­legeris: I think the thing that I’m de­scribing is just what hap­pens if you read the al­gorithm liter­ally. Like, if you read the value learn­ing al­gorithm liter­ally, it has this no­tion of the AI sys­tem wants to max­i­mize the hu­man’s ac­tual util­ity func­tion.

Ro­hin Shah: For an op­ti­mal agent play­ing a CIRL (co­op­er­a­tive in­verse re­in­force­ment learn­ing) game, I agree with your ar­gu­ment. If you take op­ti­mal­ity as defined in the co­op­er­a­tive in­verse re­in­force­ment learn­ing pa­per and it’s play­ing over a long pe­riod of time, then yes, it’s definitely go­ing to pre­fer to keep it­self in charge rather than a differ­ent AI sys­tem that would in­fer val­ues in a differ­ent way.

Lu­cas Perry: It seems like so far util­ity func­tions are the best way of try­ing to get an un­der­stand­ing of what hu­man be­ings care about and value and have prefer­ences over, you guys are bring­ing up all of the difficult in­tri­ca­cies with try­ing to un­der­stand and model hu­man prefer­ences as util­ity func­tions. One of the things that you also bring up here, Ro­hin, in your re­view, is the risk of lock-in, which may re­quire us to solve hard philo­soph­i­cal prob­lems be­fore the de­vel­op­ment of AGI. That has some­thing to do with am­bi­tious value learn­ing, which would be like learn­ing the one true hu­man util­ity func­tion which prob­a­bly just doesn’t ex­ist.

Buck Sh­legeris: I think I want to ob­ject to a lit­tle bit of your fram­ing there. My stance on util­ity func­tions of hu­mans isn’t that there are a bunch of com­pli­cated sub­tleties on top, it’s that mod­el­ing hu­mans with util­ity func­tions is just a re­ally sad state to be in. If your al­ign­ment strat­egy in­volves posit­ing that hu­mans be­have as ex­pected util­ity max­i­miz­ers, I am very pes­simistic about it work­ing in the short term, and I just think that we should be try­ing to com­pletely avoid any­thing which does that. It’s not like there’s a bunch of com­pli­cated sub-prob­lems that we need to work out about how to de­scribe us as ex­pected util­ity max­i­miz­ers, my best guess is that we would just not end up do­ing that be­cause it’s not a good idea.

Lu­cas Perry: For the am­bi­tious value learn­ing?

Buck Sh­legeris: Yeah, that’s right.

Lu­cas Perry: Okay, do you have some­thing that’s on offer?

Buck Sh­legeris: The two op­tions in­stead of that, which seem at­trac­tive to me? As I said ear­lier, one is you just con­vince ev­ery­one to not use AI sys­tems for things where you need to have an un­der­stand­ing of large scale hu­man prefer­ences. The other one is the kind of thing that Paul Chris­ti­ano’s iter­ated dis­til­la­tion and am­plifi­ca­tion, or a va­ri­ety of his other ideas, the kind of thing that he’s try­ing to get there is, I think, if you make a re­ally pow­er­ful AI sys­tem, it’s ac­tu­ally go­ing to have an ex­cel­lent model of hu­man val­ues in what­ever rep­re­sen­ta­tion is best for ac­tu­ally mak­ing pre­dic­tions about hu­mans be­cause a re­ally ex­cel­lent AGI, like a re­ally ex­cel­lent pa­per­clip max­i­mizer, it’s re­ally im­por­tant for it to re­ally get how hu­mans work so that it can ma­nipu­late them into let­ting it build lots of pa­per­clip fac­to­ries or what­ever.

So I think that if you think that we have AGI, then by as­sump­tion I think we have a sys­tem which is able to rea­son about hu­man val­ues if it wants. And so if we can ap­ply these re­ally pow­er­ful AI sys­tems to tasks such that the things that they do dis­play their good un­der­stand­ing of hu­man val­ues, then we’re fine and it’s just okay that there was no way that we could rep­re­sent a util­ity func­tion di­rectly. So for in­stance, the idea in IDA is that if we could have this sys­tem which is just try­ing to an­swer ques­tions the same way that hu­mans would, but enor­mously more cheaply be­cause it can run faster than hu­mans and a few other tricks, then we don’t have to worry about writ­ing down a util­ity func­tions of hu­mans di­rectly be­cause we can just make the sys­tem do things that are kind of similar to the things hu­mans would have done, and so it im­plic­itly has this hu­man util­ity func­tion built into it. That’s op­tion two. Op­tion one is don’t use any­thing that re­quires a com­plex hu­man util­ity func­tion, op­tion two is have your sys­tems learn hu­man val­ues im­plic­itly, by giv­ing them a task such that this is benefi­cial for them and such that their good un­der­stand­ing of hu­man val­ues comes out in their ac­tions.

Ro­hin Shah: One way I might con­dense that point, is that you’re ask­ing for a nice for­mal­ism for hu­man prefer­ences and I just point to all the hu­mans out there in the world who don’t know any­thing about util­ity func­tions, which is 99% of them and nonethe­less still seem pretty good at in­fer­ring hu­man prefer­ences.

Lu­cas Perry: On this part about AGI, if it is AGI it should be able to rea­son about hu­man prefer­ences, then why would it not be able to con­struct some­thing that was more ex­plicit and thus was able to do more am­bi­tious value learn­ing?

Buck Sh­legeris: So it can to­tally do that, it­self. But we can’t force that struc­ture from the out­side with our own al­gorithms.

Ro­hin Shah: Image clas­sifi­ca­tion is a good anal­ogy. Like, in the past we were us­ing hand en­g­ineered fea­tures, namely SIFT and HOG and then train­ing clas­sifiers over these hand en­g­ineered fea­tures in or­der to do image clas­sifi­ca­tion. And then we came to the era of deep learn­ing and we just said, yeah, throw away all those fea­tures and just do ev­ery­thing end to end with a con­volu­tional neu­ral net and it worked way bet­ter. The point was that, in fact there are good rep­re­sen­ta­tions for most tasks and hu­mans try­ing to write them down ahead of time just doesn’t work very well at that. It tends to work bet­ter if you let the AI sys­tem dis­cover its own rep­re­sen­ta­tions that best cap­ture the thing you wanted to cap­ture.

Lu­cas Perry: Can you un­pack this point a lit­tle bit more? I’m not sure that I’m com­pletely un­der­stand­ing it. Buck is re­ject­ing this mod­el­ing hu­man be­ings ex­plic­itly as ex­pected util­ity max­i­miz­ers and try­ing to ex­plic­itly come up with util­ity func­tions in our AI sys­tems. The first was to con­vince peo­ple not to use these kinds of things. And the sec­ond is to make it so that the be­hav­ior and out­put of the AI sys­tems has some im­plicit un­der­stand­ing of hu­man be­hav­ior. Can you un­pack this a bit more for me or give me an­other ex­am­ple?

Ro­hin Shah: So here’s an­other ex­am­ple. Let’s say I was teach­ing my kid that I don’t have, how to catch a ball. It seems that the for­mal­ism that’s available to me for learn­ing how to catch a ball is, well, you can go all the way down to look at our best mod­els of physics, we could use New­to­nian me­chan­ics let’s say, like here are these equa­tions, es­ti­mate the ve­loc­ity and the dis­tance of the ball and the an­gle at which it’s thrown plug that into these equa­tions and then pre­dict that the ball’s go­ing to come here and then just put your hand there and then mag­i­cally catch it. We won’t even talk about the catch­ing part. That seems like a pretty shitty way to teach a kid how to catch a ball.

Prob­a­bly it’s just a lot bet­ter to just play catch with the kid for a while and let the kid’s brain figure out this is how to pre­dict where the ball is go­ing to go such that I can pre­dict where it’s go­ing to be and then catch it.

I’m ba­si­cally 100% con­fi­dent that the thing that the brain is do­ing is not New­to­nian me­chan­ics. It’s do­ing some­thing else that’s just way more effi­cient at pre­dict­ing where the ball is go­ing to be so that I can catch it and if I forced the brain to use New­to­nian me­chan­ics, I bet it would not do very well at this task.

Buck Sh­legeris: I feel like that still isn’t quite say­ing the key thing here. I don’t know how to say this off the top of my head ei­ther, but I think there’s this key point about: just be­cause your neu­ral net can learn a par­tic­u­lar fea­ture of the world doesn’t mean that you can back out some other prop­erty of the world by forc­ing the neu­ral net to have a par­tic­u­lar shape. Does that make any sense, Ro­hin?

Ro­hin Shah: Yeah, vaguely. I mean, well, no, maybe not.

Buck Sh­legeris: The prob­lem isn’t just the ca­pa­bil­ities prob­lem. There’s this way you can try and in­fer a hu­man util­ity func­tion by ask­ing, ac­cord­ing to this model, what’s the max­i­mum like­li­hood util­ity func­tion given all these things the hu­man did. If you have a good enough model, you will in fact end up mak­ing very good pre­dic­tions about the hu­man, it’s just that the de­com­po­si­tion into their plan­ning func­tion and their util­ity func­tion is not go­ing to re­sult in a util­ity func­tion which is any­thing like a thing that I would want max­i­mized if this pro­cess was done on me. There is go­ing to be some de­com­po­si­tion like this, which is to­tally fine, but the util­ity func­tion part just isn’t go­ing to cor­re­spond to the thing that I want.

Ro­hin Shah: Yeah, that is also a prob­lem, but I agree that is not the thing I was de­scribing.

Lu­cas Perry: Is the point there that there’s a lack of al­ign­ment be­tween the util­ity func­tion and the plan­ning func­tion. Given that the plan­ning func­tion im­perfectly op­ti­mizes the util­ity func­tion.

Ro­hin Shah: It’s more like there are just in­finitely many pos­si­ble pairs of plan­ning func­tions and util­ity func­tions that ex­actly pre­dict hu­man be­hav­ior. Even if it were true that hu­mans were ex­pected util­ity max­i­miz­ers, which Buck is ar­gu­ing we’re not, and I agree with him. There is a plan­ning func­tion that’s like hu­mans are perfectly anti-ra­tio­nal and if you’re like what util­ity func­tion works with that plan­ner to pre­dict hu­man be­hav­ior. Well, the literal nega­tive of the true util­ity func­tion when com­bined with the anti-ra­tio­nal plan­ner pro­duces the same be­hav­ior as the true util­ity func­tion with the perfect plan­ner, there’s no in­for­ma­tion that lets you dis­t­in­guish be­tween these two pos­si­bil­ities.

You have to build it in as an as­sump­tion. I think Buck’s point is that build­ing things in as as­sump­tions is prob­a­bly not go­ing to work.

Buck Sh­legeris: Yeah.

Ro­hin Shah: A point I agree with. In philos­o­phy this is called the is-ought prob­lem, right? What you can train your AI sys­tem on is a bunch of “is” facts and then you have to add in some as­sump­tions in or­der to jump to “ought” facts, which is what the util­ity func­tion is try­ing to do. The util­ity func­tion is try­ing to tell you how you ought to be­have in new situ­a­tions and the point of the is-ought dis­tinc­tion is that you need some bridg­ing as­sump­tions in or­der to get from is to ought.

Buck Sh­legeris: And I guess an im­por­tant part here is your sys­tem will do an amaz­ing job of an­swer­ing “is” ques­tions about what hu­mans would say about “ought” ques­tions. And so I guess maybe you could phrase the sec­ond part as: to get your sys­tem to do things that match hu­man prefer­ences, use the fact that it knows how to make ac­cu­rate “is” state­ments about hu­mans’ ought state­ments?

Lu­cas Perry: It seems like we’re strictly talk­ing about in­fer­ring the hu­man util­ity func­tion or prefer­ences via look­ing at be­hav­ior. What if you also had more ac­cess to the ac­tual struc­ture of the hu­man’s brain?

Ro­hin Shah: This is like the ap­proach that Stu­art Arm­strong likes to talk about. The same things still ap­ply. You still have the is-ought prob­lem where the facts about the brain are “is” facts and how you trans­late that into “ought” facts is go­ing to in­volve some as­sump­tions. Maybe you can break down such as­sump­tions that ev­ery­one would agree with. Maybe it’s like if this par­tic­u­lar neu­ron in a hu­man brain spikes, that’s a good thing and we want more of it and if this other one spikes, that’s a bad thing. We don’t want it. Maybe that as­sump­tion is fine.

Lu­cas Perry: I guess I’m just point­ing out, if you could find the places in the hu­man brain that gen­er­ate the state­ments about Ought ques­tions.

Ro­hin Shah: As Buck said, that lets you pre­dict what hu­mans would say about ought state­ments, which your as­sump­tion could then be, what­ever hu­mans say about ought state­ments, that’s what you ought to do. And that’s still an as­sump­tion. Maybe it’s a very rea­son­able as­sump­tion that we’re happy to put it into our AI sys­tem.

Lu­cas Perry: If we’re not will­ing to ac­cept some hu­mans’ “is” state­ments about “ought” ques­tions then we have to do some meta-eth­i­cal moral polic­ing in our as­sump­tions around get­ting “is” state­ments from “ought” ques­tions.

Ro­hin Shah: Yes, that seems right to me. I don’t know how you would do such a thing, but you would have to do some­thing along those lines.

Buck Sh­legeris: I would ad­di­tion­ally say that I feel pretty great about try­ing to do things which use the fact that we can trust our AI to have good “is” an­swers to “ought” ques­tions, but there’s a bunch of prob­lems with this. I think it’s a good start­ing point but try­ing to use that to do ar­bi­trar­ily com­pli­cated things in the world has a lot of prob­lems. For in­stance, sup­pose I’m try­ing to de­cide whether we should de­sign a city this way or that way. It’s hard to know how to go from the abil­ity to know how hu­mans would an­swer ques­tions about prefer­ences to know­ing what you should do to de­sign the city. And this is for a bunch of rea­sons, one of them is that the hu­man might not be able to figure out from your city build­ing plans what the city’s go­ing to ac­tu­ally be like. And an­other is that the hu­man might give in­con­sis­tent an­swers about what de­sign is good, de­pend­ing on how you phrase the ques­tion, such that if you try to figure out a good city plan by op­ti­miz­ing for the thing that the hu­man is go­ing to be most en­thu­si­as­tic about, then you might end up with a bad city plan. Paul Chris­ti­ano has writ­ten in a lot of de­tail about a lot of this.

Lu­cas Perry: That also re­minds me of what Stu­art Arm­strong wrote about the fram­ing on the ques­tions chang­ing out­put on the prefer­ence.

Ro­hin Shah: Yep.

Buck Sh­legeris: Sorry, to be clear other peo­ple than Paul Chris­ti­ano have also writ­ten a lot about this stuff, (in­clud­ing Ro­hin). My fa­vorite writ­ing about this stuff is by Paul.

Lu­cas Perry: Yeah, those do seem prob­le­matic but it would also seem that there would be fur­ther “is” state­ments that if you queried peo­ple’s meta-prefer­ences about those things, you would get more “is” state­ments about that, but then that just pushes the “ought” as­sump­tions that you need to make fur­ther back. Get­ting into very philo­soph­i­cally weedy ter­ri­tory. Do you think that this kind of thing could be pushed to the long re­flec­tion as is talked about by William MacAskill and Toby Ord or how much of this do you ac­tu­ally think needs to be solved in or­der to have safe and al­igned AGI?

Buck Sh­legeris: I think there are kind of two differ­ent ways that you could hope to have good out­comes from AGI. One is: set up a world such that you never needed to make an AGI which can make large scale de­ci­sions about the world. And two is: solve the full al­ign­ment prob­lem.

I’m cur­rently pretty pes­simistic about the sec­ond of those be­ing tech­ni­cally fea­si­ble. And I’m kind of pretty pes­simistic about the first of those be­ing a plan that will work. But in the world where you can have ev­ery­one only ap­ply pow­er­ful and dan­ger­ous AI sys­tems in ways that don’t re­quire an un­der­stand­ing of hu­man val­ues, then you can push all of these prob­lems onto the long re­flec­tion. In wor­lds where you can do ar­bi­trar­ily com­pli­cated things in ways that hu­mans would ap­prove of, you don’t re­ally need to long re­flect this stuff be­cause of the fact that these pow­er­ful AI sys­tems already have the ca­pac­ity of do­ing por­tions of the long re­flec­tion work in­side them­selves as needed. (Quotes about the long re­flec­tion)

Ro­hin Shah: Yeah, so I think my take, it’s not ex­actly dis­agree­ing with Buck. It’s more like from a differ­ent frame as Buck’s. If you just got AI sys­tems that did the things that hu­mans did now, this does not seem to me to ob­vi­ously re­quire solv­ing hard prob­lems in philos­o­phy. That’s the lower bound on what you can do be­fore hav­ing to do long re­flec­tion type stuff. Even­tu­ally you do want to do a longer re­flec­tion. I feel rel­a­tively op­ti­mistic about hav­ing a tech­ni­cal solu­tion to al­ign­ment that al­lows us to do the long re­flec­tion af­ter build­ing AI sys­tems. So the long re­flec­tion would in­clude both hu­mans and AI sys­tems think­ing hard, re­flect­ing on difficult prob­lems and so on.

Buck Sh­legeris: To be clear, I’m su­per en­thu­si­as­tic about there be­ing a long re­flec­tion or some­thing along those lines.

Lu­cas Perry: I always find it use­ful re­flect­ing on just how hu­man be­ings do many of these things be­cause I think that when think­ing about things in the strict AI al­ign­ment sense, it can seem al­most im­pos­si­ble, but hu­man be­ings are able to do so many of these things with­out solv­ing all of these difficult prob­lems. It seems like in the very least, we’ll be able to get AI sys­tems that very, very ap­prox­i­mately do what is good or what is ap­proved of by hu­man be­ings be­cause we can already do that.

Buck Sh­legeris: That ar­gu­ment doesn’t re­ally make sense to me. It also didn’t make sense when Ro­hin referred to it a minute ago.

Ro­hin Shah: It’s not an ar­gu­ment for we tech­ni­cally know how to do this. It is more an ar­gu­ment for this as at least within the space of pos­si­bil­ities.

Lu­cas Perry: Yeah, I guess that’s how I was also think­ing of it. It is within the space of pos­si­bil­ities. So util­ity func­tions are good be­cause they can be op­ti­mized for, and there seem to be risks with op­ti­miza­tion. Is there any­thing here that you guys would like to say about bet­ter un­der­stand­ing agency? I know this is one of the things that is im­por­tant within the MIRI agenda.

Buck Sh­legeris: I am a bad MIRI em­ployee. I don’t re­ally get that part of the MIRI agenda, and so I’m not go­ing to defend it. I have cer­tainly learned some in­ter­est­ing things from talk­ing to Scott Garrabrant and other MIRI peo­ple who have lots of in­ter­est­ing thoughts about this stuff. I don’t quite see the path from there to good al­ign­ment strate­gies. But I also haven’t spent a su­per long time think­ing about it be­cause I, in gen­eral, don’t try to think about all of the differ­ent AI al­ign­ment things that I could pos­si­bly think about.

Ro­hin Shah: Yeah. I also am not a good per­son to ask about this. Most of my knowl­edge comes from read­ing things and MIRI has stopped writ­ing things very much re­cently, so I don’t know what their ideas are. I, like Buck, don’t re­ally see a good al­ign­ment strat­egy that starts with, first we un­der­stand op­ti­miza­tion and so that’s the main rea­son why I haven’t looked into it very much.

Buck Sh­legeris: I think I don’t ac­tu­ally agree with the thing you said there, Ro­hin. I feel like un­der­stand­ing op­ti­miza­tion could plau­si­bly be re­ally nice. Ba­si­cally the story there is, it’s a real bum­mer if we have to make re­ally pow­er­ful AI sys­tems via search­ing over large re­cur­rent poli­cies for things that im­ple­ment op­ti­miz­ers. If it turned out that we could figure out some way of cod­ing up op­ti­mizer stuffs di­rectly, then this could maybe mean you didn’t need to make mesa-op­ti­miz­ers. And maybe this means that your in­ner al­ign­ment prob­lems go away, which could be re­ally nice. The thing that I was say­ing I haven’t thought that much about is, the rele­vance of think­ing about, for in­stance, the var­i­ous weird­nesses that hap­pen when you con­sider em­bed­ded agency or de­ci­sion the­ory, and things like that.

Ro­hin Shah: Oh, got it. Yeah. I think I agree that un­der­stand­ing op­ti­miza­tion would be great if we suc­ceeded at it and I’m mostly pes­simistic about us suc­ceed­ing at it, but also there are peo­ple who are op­ti­mistic about it and I don’t know why they’re op­ti­mistic about it.

Lu­cas Perry: Hey it’s post-pod­cast Lu­cas here again. So, I just want to add a lit­tle more de­tail here again on be­half of Ro­hin. Here he feels pes­simistic about us un­der­stand­ing op­ti­miza­tion well enough and in a short enough time pe­riod that we are able to cre­ate pow­er­ful op­ti­miz­ers that we un­der­stand that ri­val the perfor­mance of the AI sys­tems we’re already build­ing and will build in the near fu­ture. Back to the epi­sode.

Buck Sh­legeris: The ar­gu­ments that MIRI has made about this,… they think that there are a bunch of ques­tions about what op­ti­miza­tion is, that are plau­si­bly just not that hard com­pared to other prob­lems which small groups of peo­ple have oc­ca­sion­ally solved, like com­ing up with foun­da­tions of math­e­mat­ics, kind of a big con­cep­tual deal but also a rel­a­tively small group of peo­ple. And be­fore we had for­mal­iza­tions of math, I think it might’ve seemed as im­pos­si­ble to progress on as for­mal­iz­ing op­ti­miza­tion or com­ing up with a bet­ter pic­ture of that. So maybe that’s my ar­gu­ment for some op­ti­mism.

Ro­hin Shah: Yeah, I think point­ing to some ex­am­ples of great suc­cess does not im­ply… Like there are prob­a­bly many similar things that didn’t work out and we don’t know about them cause no­body both­ered to tell us about them be­cause they failed. Seems plau­si­ble maybe.

Lu­cas Perry: So, ex­plor­ing more deeply this point of agency can ei­ther, or both of you, give us a lit­tle bit of a pic­ture about the rele­vance or non rele­vance of de­ci­sion the­ory here to AI al­ign­ment and I think, Buck, you men­tioned the trick­i­ness of em­bed­ded de­ci­sion the­ory.

Ro­hin Shah: If you go back to our tra­di­tional ar­gu­ment for AI risk, it’s ba­si­cally pow­er­ful AI sys­tems will be very strong op­ti­miz­ers. They will pos­si­bly be mis­al­igned with us and this is bad. And in par­tic­u­lar one spe­cific way that you might imag­ine this go­ing wrong is this idea of mesa op­ti­miza­tion where we don’t know how to build op­ti­miz­ers right now. And so what we end up do­ing is ba­si­cally search across a huge num­ber of pro­grams look­ing for ones that do well at op­ti­miza­tion and use that as our AGI sys­tem. And in this world, if you buy that as a model of what’s hap­pen­ing, then you’ll ba­si­cally have al­most no con­trol over what ex­actly that sys­tem is op­ti­miz­ing for. And that seems like a recipe for mis­al­ign­ment. It sure would be bet­ter if we could build the op­ti­mizer di­rectly and know what it is op­ti­miz­ing for. And in or­der to do that, we need to know how to do op­ti­miza­tion well.

Lu­cas Perry: What are the kinds of places that we use mesa op­ti­miz­ers to­day?

Ro­hin Shah: It’s not used very much yet. The field of meta learn­ing is the clos­est ex­am­ple. In the field of meta learn­ing you have a dis­tri­bu­tion over tasks and you use gra­di­ent de­scent or some other AI tech­nique in or­der to find an AI sys­tem that it­self, once given a new task, learns how to perform that task well.

Ex­ist­ing meta learn­ing sys­tems are more like learn­ing how to do all the tasks well and then when they’ll see a new task they just figure out ah, it’s this task and then they roll out the policy that they already learned. But the even­tual goal for meta learn­ing is to get some­thing that, on­line, learns how to do the task with­out hav­ing pre­vi­ously figured out how to do that task.

Lu­cas Perry: Okay, so Ro­hin did what you say cover em­bed­ded de­ci­sion the­ory?

Ro­hin Shah: No, not re­ally. I think em­bed­ded de­ci­sion the­ory is just, we want to un­der­stand op­ti­miza­tion. Our cur­rent no­tion of op­ti­miza­tion, one way you could for­mal­ize it is to say my AI agent is go­ing to have Bayesian be­lief over all the pos­si­ble ways that the en­vi­ron­ment could be. It’s go­ing to up­date that be­lief over time as it gets ob­ser­va­tions and then it’s go­ing to act op­ti­mally with re­spect to that be­lief, by max­i­miz­ing its ex­pected util­ity. And em­bed­ded de­ci­sion the­ory ba­si­cally calls into ques­tion the idea that there’s a sep­a­ra­tion be­tween the agent and the en­vi­ron­ment. In par­tic­u­lar I, as a hu­man, couldn’t pos­si­bly have a Bayesian be­lief about the en­tire earth be­cause the en­tire Earth con­tains me. I can’t have a Bayesian be­lief over my­self so this means that our ex­ist­ing for­mal­iza­tion of agency is flawed. It can’t cap­ture these things that af­fect real agents. And em­bed­ded de­ci­sion the­ory, em­bed­ded agency, more broadly, is try­ing to deal with this fact and have a new for­mal­iza­tion that works even in these situ­a­tions.

Buck Sh­legeris: I want to give my un­der­stand­ing of the pitch for it. One part is that if you don’t un­der­stand em­bed­ded agency, then if you try to make an AI sys­tem in a hard coded way, like mak­ing a hard coded op­ti­mizer, tra­di­tional phras­ings of what an op­ti­mizer is, are just liter­ally wrong in that, for ex­am­ple, they’re as­sum­ing that you have these mas­sive be­liefs over world states that you can’t re­ally have. And plau­si­bly, it is re­ally bad to try to make sys­tems by hard­cod­ing as­sump­tions that are just clearly false. And so if we want to hard­code agents with par­tic­u­lar prop­er­ties, it would be good if we knew a way of cod­ing the agent that isn’t im­plic­itly mak­ing clearly false as­sump­tions.

And the sec­ond pitch for it is some­thing like when you want to un­der­stand a topic, some­times it’s worth look­ing at some­thing about the topic which you’re definitely wrong about, and try­ing to think about that part un­til you are less con­fused about it. When I’m study­ing physics or some­thing, a thing that I love do­ing is look­ing for the eas­iest ques­tion whose an­swer I don’t know, and then try­ing to just dive in un­til I have satis­fac­to­rily an­swered that ques­tion, hop­ing that the prac­tice that I get about think­ing about physics from an­swer­ing a ques­tion cor­rectly will gen­er­al­ize to much harder ques­tions. I think that’s part of the pitch here. Here is a prob­lem that we would need to an­swer, if we wanted to un­der­stand how su­per­in­tel­li­gent AI sys­tems work, so we should try an­swer­ing it be­cause it seems eas­ier than some of the other prob­lems.

Lu­cas Perry: Okay. I think I feel satis­fied. The next thing here Ro­hin in your AI al­ign­ment 2018-19 re­view is value learn­ing. I feel like we’ve talked a bunch about this already. Is there any­thing here that you want to say or do you want to skip this?

Ro­hin Shah: One thing we didn’t cover is, if you have un­cer­tainty over what you’re sup­posed to op­ti­mize, this turns into an in­ter­ac­tive sort of game be­tween the hu­man and the AI agent, which seems pretty good. A pri­ori you should ex­pect that there’s go­ing to need to be a lot of in­ter­ac­tion be­tween the hu­man and the AI sys­tem in or­der for the AI sys­tem to ac­tu­ally be able to do the things that the hu­man wants it to do. And so hav­ing for­mal­isms and ideas of where this in­ter­ac­tion nat­u­rally falls out seems like a good thing.

Buck Sh­legeris: I’ve said a lot of things about how I am very pes­simistic about value learn­ing as a strat­egy. Nev­er­the­less it seems like it might be re­ally good for there to be peo­ple who are re­search­ing this, and try­ing to get as good as we can get at im­prov­ing sam­ple effi­ciency so that can have your AI sys­tems un­der­stand your prefer­ences over mu­sic with as lit­tle hu­man in­ter­ac­tion as pos­si­ble, just in case it turns out to be pos­si­ble to solve the hard ver­sion of value learn­ing. Be­cause a lot of the en­g­ineer­ing effort re­quired to make am­bi­tious value learn­ing work will plau­si­bly be in com­mon with the kinds of stuff you have to do to make these more sim­ple speci­fi­ca­tion learn­ing tasks work out. That’s a rea­son for me to be en­thu­si­as­tic about peo­ple re­search­ing value learn­ing even if I’m pes­simistic about the over­all thing work­ing.

Lu­cas Perry: All right, so what is ro­bust­ness and why does it mat­ter?

Ro­hin Shah: Ro­bust­ness is one of those words that doesn’t su­per clearly have a defi­ni­tion and peo­ple use it differ­ently. Ro­bust agents don’t fail catas­troph­i­cally in situ­a­tions slightly differ­ent from the ones that they were de­signed for. One ex­am­ple of a case where we see a failure of ro­bust­ness cur­rently, is in ad­ver­sar­ial ex­am­ples for image clas­sifiers, where it is pos­si­ble to take an image, make a slight per­tur­ba­tion to it, and then the re­sult­ing image is com­pletely mis­clas­sified. You take a cor­rectly clas­sified image of a Panda, slightly per­turb it such that a hu­man can’t tell what the differ­ence is, and then it’s clas­sified as a gib­bon with 99% con­fi­dence. Ad­mit­tedly this was with an older image clas­sifier. I think you need to make the per­tur­ba­tions a bit larger now in or­der to get them.

Lu­cas Perry: This is be­cause the rele­vant in­for­ma­tion that it uses are very lo­cal to in­fer panda-ness rather than global prop­er­ties of the panda?

Ro­hin Shah: It’s more like they’re high fre­quency fea­tures or im­per­cep­ti­ble fea­tures. There’s a lot of con­tro­versy about this but there is a pretty pop­u­lar re­cent pa­per that I be­lieve, but not ev­ery­one be­lieves, that claims that this was be­cause they’re pick­ing up on real im­per­cep­ti­ble fea­tures that do gen­er­al­ize to the test set, that hu­mans can’t de­tect. That’s an ex­am­ple of ro­bust­ness. Re­cently peo­ple have been ap­ply­ing this to re­in­force­ment learn­ing both by ad­ver­sar­i­ally mod­ify­ing the ob­ser­va­tions that agents get and also by train­ing agents that act in the en­vi­ron­ment ad­ver­sar­i­ally to­wards the origi­nal agent. One pa­per out of CHAI showed that there’s this kick and defend en­vi­ron­ment where you’ve got two MuJoCo robots. One of them is kick­ing a soc­cer ball. The other one’s a goalie, that’s try­ing to pre­vent the kicker from suc­cess­fully shoot­ing a goal, and they showed that if you do self play in or­der to get kick­ers and defen­ders and then you take the kicker, you freeze it, you don’t train it any­more and you re­train a new defen­der against this kicker.

What is the strat­egy that this new defen­der learns? It just sort of falls to the ground and flaps about in a ran­dom look­ing way and the kicker just gets so con­fused that it usu­ally fails to even touch the ball and so this is sort of an ad­ver­sar­ial ex­am­ple for RL agents now, it’s show­ing that even they’re not very ro­bust.

There was also a pa­per out of Deep­Mind that did the same sort of thing. For their ad­ver­sar­ial at­tack they learned what sorts of mis­takes the agent would make early on in train­ing and then just tried to repli­cate those mis­takes once the agent was fully trained and they found that this helped them un­cover a lot of bad be­hav­iors. Even at the end of train­ing.

From the per­spec­tive of al­ign­ment, it’s clear that we want ro­bust­ness. It’s not ex­actly clear what we want ro­bust­ness to. This ro­bust­ness to ad­ver­sar­ial per­tur­ba­tions was kind of a bit weird as a threat model. If there is an ad­ver­sary in the en­vi­ron­ment they’re prob­a­bly not go­ing to be re­stricted to small per­tur­ba­tions. They’re prob­a­bly not go­ing to get white box ac­cess to your AI sys­tem; even if they did, this doesn’t seem to re­ally con­nect with the AI sys­tem as ad­ver­sar­i­ally op­ti­miz­ing against hu­mans story, which is how we get to the x-risk part, so it’s not to­tally clear.

I think on the in­tent al­ign­ment case, which is the thing that I usu­ally think about, you mostly want to en­sure that what­ever is driv­ing the “mo­ti­va­tion” of the AI sys­tem, you want that to be very ro­bust. You want it to agree with what hu­mans would want in all situ­a­tions or at least all situ­a­tions that are go­ing to come up or some­thing like that. Paul Chris­ti­ano has writ­ten a few blog posts about this that talk about what tech­niques he’s ex­cited about solv­ing that prob­lem, which boil down to in­ter­pretabil­ity, ad­ver­sar­ial train­ing, and im­prov­ing ad­ver­sar­ial train­ing through re­lax­ations of the prob­lem.

Buck Sh­legeris: I’m pretty con­fused about this, and so it’s pos­si­ble what I’m go­ing to say is dumb. When I look at prob­lems with ro­bust­ness or prob­lems that Ro­hin put in this ro­bust­ness cat­e­gory here, I want to di­vide it into two parts. One of the parts is, things that I think of as ca­pa­bil­ity prob­lems, which I kind of ex­pect the rest of the world will need to solve on its own. For in­stance, things about safe ex­plo­ra­tion, how do I get my sys­tem to learn to do good things with­out ever do­ing re­ally bad things, this just doesn’t seem very re­lated to the AI al­ign­ment prob­lem to me. And I also feel rea­son­ably op­ti­mistic that you can solve it by do­ing dumb tech­niques which don’t have any­thing too difficult to them, like you can have your sys­tem so that it has a good model of the world that it got from un­su­per­vised learn­ing some­how and then it never does dumb enough things. And also I don’t re­ally see that kind of ro­bust­ness prob­lem lead­ing to ex­is­ten­tial catas­tro­phes. And the other half of ro­bust­ness is the half that I care about a lot, which in my mind, is mostly try­ing to make sure that you suc­ceeded at in­ner al­ign­ment. That is, that the mesa op­ti­miz­ers you’ve found through gra­di­ent de­scent have goals that ac­tu­ally match your goals.

This is like ro­bust­ness in the sense that you’re try­ing to guaran­tee that in ev­ery situ­a­tion, your AI sys­tem, as Ro­hin was say­ing, is in­tent al­igned with you. It’s try­ing to do the kind of thing that you want. And I worry that, by de­fault, we’re go­ing to end up with AI sys­tems not in­tent al­igned, so there ex­ist a bunch of situ­a­tions they can be put in such that they do things that are very much not what you’d want, and there­fore they fail at ro­bust­ness. I think this is a re­ally im­por­tant prob­lem, it’s like half of the AI safety prob­lem or more, in my mind, and I’m not very op­ti­mistic about be­ing able to solve it with pro­saic tech­niques.

Ro­hin Shah: That sounds roughly similar to what I was say­ing. Yes.

Buck Sh­legeris: I don’t think we dis­agree about this su­per much ex­cept for the fact that I think you seem to care more about safe ex­plo­ra­tion and similar stuff than I think I do.

Ro­hin Shah: I think safe ex­plo­ra­tion’s a bad ex­am­ple. I don’t know what safe ex­plo­ra­tion is even try­ing to solve but I think other stuff, I agree. I do care about it more. One place where I some­what dis­agree with you is, you sort of have this point about all these ro­bust­ness prob­lems are the things that the rest of the world has in­cen­tives to figure out, and will prob­a­bly figure out. That seems true for al­ign­ment too, it sure seems like you want your sys­tem to be al­igned in or­der to do the things that you ac­tu­ally want. Every­one that has an in­cen­tive for this to hap­pen. I to­tally ex­pect peo­ple who aren’t EAs or ra­tio­nal­ists or weird longter­mists to be work­ing on AI al­ign­ment in the fu­ture and to some ex­tent even now. I think that’s one thing.

Buck Sh­legeris: You should say your other thing, but then I want to get back to that point.

Ro­hin Shah: The other thing is I think I agree with you that it’s not clear to me how failures of the ro­bust­ness of things other than mo­ti­va­tion lead to x-risk, but I’m more op­ti­mistic than you are that our solu­tions to those kinds of ro­bust­ness will help with the solu­tions to “mo­ti­va­tion ro­bust­ness” or how to make your mesa op­ti­mizer al­igned.

Buck Sh­legeris: Yeah, sorry, I guess I ac­tu­ally do agree with that last point. I am very in­ter­ested in try­ing to figure out how to have al­igned to mesa op­ti­miz­ers, and I think that a rea­son­able strat­egy to pur­sue in or­der to get al­igned mesa op­ti­miz­ers is try­ing to figure out how to make your image clas­sifiers ro­bust to ad­ver­sar­ial ex­am­ples. I think you prob­a­bly won’t suc­ceed even if you suc­ceed with the image clas­sifiers, but it seems like the image clas­sifiers are still prob­a­bly where you should start. And I guess if we can’t figure out how to make image clas­sifiers ro­bust to ad­ver­sar­ial ex­am­ples in like 10 years, I’m go­ing to be su­per pes­simistic about the harder ro­bust­ness prob­lem, and that would be great to know.

Ro­hin Shah: For what it’s worth, my take on the ad­ver­sar­ial ex­am­ples of image clas­sifiers is, we’re go­ing to train image clas­sifiers on more data with big­ger nets, it’s just go­ing to mostly go away. Pre­dic­tion. I’m lay­ing my cards on the table.

Buck Sh­legeris: That’s also some­thing like my guess.

Ro­hin Shah: Okay.

Buck Sh­legeris: My pre­dic­tion is: to get image clas­sifiers that are ro­bust to ep­silon ball per­tur­ba­tions or what­ever, some com­bi­na­tion of larger things and ad­ver­sar­ial train­ing and a cou­ple other clever things, will prob­a­bly mean that we have ro­bust image clas­sifiers in 5 or 10 years at the lat­est.

Ro­hin Shah: Cool. And you wanted to re­turn to the other point about the world hav­ing in­cen­tives to do al­ign­ment.

Buck Sh­legeris: So I don’t quite know how to ex­press this, but I think it’s re­ally im­por­tant which is go­ing to make this a re­ally fun ex­pe­rience for ev­ery­one in­volved. You know how Airbnb… Or sorry, I guess a bet­ter ex­am­ple of this is ac­tu­ally Uber drivers. Where I give ba­si­cally ev­ery Uber driver a five star rat­ing, even though some Uber drivers are just clearly more pleas­ant for me than oth­ers, and Uber doesn’t seem to try very hard to get around these prob­lems, even though I think that if Uber caused there to be a 30% differ­ence in pay be­tween the drivers who I think of as 75th per­centile and the drivers I think of as 25th per­centile, this would make the ser­vice prob­a­bly no­tice­ably bet­ter for me. I guess it seems to me that a lot of the time the world just doesn’t try do kind of com­pli­cated things to make sys­tems ac­tu­ally al­igned, and it just does hack jobs, and then ev­ery­one deals with the fact that ev­ery­thing is un­al­igned as a re­sult.

To draw this anal­ogy back, I think that we’re likely to have the kind of al­ign­ment tech­niques that solve prob­lems that are as sim­ple and ob­vi­ous as: we should have a way to have rate your hosts on Airbnb. But I’m wor­ried that we won’t ever get around to solv­ing the prob­lems that are like, but what if your hosts are in­cen­tivized to tell you sob sto­ries such that you give them good rat­ings, even though ac­tu­ally they were worse than some other hosts. And this is never a big enough deal that peo­ple are unilat­er­ally in­di­vi­d­u­ally in­cen­tivized to solve the harder ver­sion of the al­ign­ment prob­lem, and then ev­ery­one ends up us­ing these sys­tems that ac­tu­ally aren’t al­igned in the strong sense and then we end up in a doomy world. I’m cu­ri­ous if any of that made any sense.

Lu­cas Perry: Is a sim­ple way to put that we fall into in­ad­e­quate or an un­op­ti­mal equil­ibrium and then there’s tragedy of the com­mons and bad game the­ory stuff that hap­pens that keeps us locked and that the same story could ap­ply to al­ign­ment?

Buck Sh­legeris: Yeah, that’s not quite what I mean.

Lu­cas Perry: Okay.

Ro­hin Shah: I think Buck’s point is that ac­tu­ally Uber or Airbnb could unilat­er­ally, no gains re­quired, make their sys­tem bet­ter and this would be an im­prove­ment for them and ev­ery­one else, and they don’t do it. There is noth­ing about equil­ibrium that is a failure of Uber to do this thing that seems so ob­vi­ously good.

Buck Sh­legeris: I’m not ac­tu­ally claiming that it’s bet­ter for Uber, I’m just claiming that there is a mis­al­ign­ment there. Plau­si­bly, an Uber exec, if they were listen­ing to this they’d just be like, “LOL, that’s a re­ally stupid idea. Peo­ple would hate it.” And then they would say more com­pli­cated things like “most rid­ers are rel­a­tively price sen­si­tive and so this doesn’t mat­ter.” And plau­si­bly they’re com­pletely right.

Ro­hin Shah: That’s what I was go­ing to say.

Buck Sh­legeris: But the thing which feels im­por­tant to me is some­thing like a lot of the time it’s not worth solv­ing the al­ign­ment prob­lems at any given mo­ment be­cause some­thing else is a big­ger prob­lem to how things are go­ing lo­cally. And this can con­tinue be­ing the case for a long time, and then you end up with ev­ery­one be­ing locked in to this sys­tem where they never solved the al­ign­ment prob­lems. And it’s re­ally hard to make peo­ple un­der­stand this, and then you get locked into this bad world.

Ro­hin Shah: So if I were to try and put that in the con­text of AI al­ign­ment, I think this is a le­gi­t­i­mate rea­son for be­ing more pes­simistic. And the way that I would make that ar­gu­ment is: it sure seems like we are go­ing to de­cide on what method or path we’re go­ing to use to build AGI. Maybe we’ll do a bunch of re­search and de­cide we’re just go­ing to scale up lan­guage mod­els or some­thing like this. I don’t know. And we will do that be­fore we have any idea of which tech­nique would be eas­iest to al­ign and as a re­sult, we will be forced to try to al­ign this ex­oge­nously cho­sen AGI tech­nique and that would be harder than if we got to de­sign our al­ign­ment tech­niques and our AGI tech­niques si­mul­ta­neously.

Buck Sh­legeris: I’m imag­in­ing some pretty slow take off here, and I don’t imag­ine this as ever hav­ing a phase where we built this AGI and now we need to al­ign it. It’s more like we’re con­tin­u­ously build­ing and de­ploy­ing these sys­tems that are grad­u­ally more and more pow­er­ful, and ev­ery time we want to de­ploy a sys­tem, it has to be do­ing some­thing which is use­ful to some­one. And many of the things which are use­ful, re­quire things that are kind of like al­ign­ment. “I want to make a lot of money from my sys­tem that will give ad­vice,” and if it wants to give good gen­er­al­ist ad­vice over email, it’s go­ing to need to have at least some im­plicit un­der­stand­ing of hu­man prefer­ences. Maybe we just use gi­ant lan­guage mod­els and ev­ery­thing’s just to­tally fine here. A re­ally good lan­guage model isn’t able to give ar­bi­trar­ily good al­igned ad­vice, but you can get ad­vice that sounds re­ally good from a lan­guage model, and I’m wor­ried that the de­fault path is go­ing to in­volve the most pop­u­lar AI ad­vice ser­vices be­ing kind of mis­al­igned, and just never both­er­ing to fix that. Does that make any more sense?

Ro­hin Shah: Yeah, I think I to­tally buy that that will hap­pen. But I think I’m more like as you get to AI sys­tems do­ing more and more im­por­tant things in the world, it be­comes more and more im­por­tant that they are re­ally truly al­igned and in­vest­ment in al­ign­ment in­creases cor­re­spond­ingly.

Buck Sh­legeris: What’s the mechanism by which peo­ple re­al­ize that they need to put more work into al­ign­ment here?

Ro­hin Shah: I think there’s mul­ti­ple. One is I ex­pect that peo­ple are aware, like even in the Uber case, I ex­pect peo­ple are aware of the mis­al­ign­ment that ex­ists, but de­cide that it’s not worth their time to fix it. So the con­tinu­a­tion of that, peo­ple will be aware of it and then they will de­cide that they should fix it.

Buck Sh­legeris: If I’m try­ing to sell to city gov­ern­ments this lan­guage model based sys­tem which will give them ad­vice on city plan­ning, it’s not clear to me that at any point the city gov­ern­ments are go­ing to start de­mand­ing bet­ter al­ign­ment fea­tures. Maybe that’s the way that it goes but it doesn’t seem ob­vi­ous that city gov­ern­ments would think to ask that, and --

Ro­hin Shah: I wasn’t imag­in­ing this from the user side. I was imag­in­ing this from the en­g­ineers or de­sign­ers side.

Buck Sh­legeris: Yeah.

Ro­hin Shah: I think from the user side I would speak more to warn­ing shots. You know, you have your cashier AI sys­tem or your waiter AIs and they were op­ti­miz­ing for tips more so than ac­tu­ally col­lect­ing money and so they like offer free meals in or­der to get more tips. At some point one of these AI sys­tems passes all of the in­ter­nal checks and makes it out into the world and only then does the prob­lem arise and ev­ery­one’s like, “Oh my God, this is ter­rible. What the hell are you do­ing? Make this bet­ter.”

Buck Sh­legeris: There’s two mechanisms via which that al­ign­ment might be okay. One of them is that re­searchers might re­al­ize that they want to put more effort into al­ign­ment and then solve these prob­lems. The other mechanism is that users might de­mand bet­ter al­ign­ment be­cause of warn­ing shots. I think that I don’t buy that ei­ther of these is suffi­cient. I don’t buy that it’s suffi­cient for re­searchers to de­cide to do it be­cause in a com­pet­i­tive world, the re­searchers who re­al­ize this is im­por­tant, if they try to only make al­igned prod­ucts, they are not go­ing to be able to sell them be­cause their prod­ucts will be much less good than the un­al­igned ones. So you have to ar­gue that there is de­mand for the things which are ac­tu­ally al­igned well. But for this to work, your users have to be able to dis­t­in­guish be­tween things that have good al­ign­ment prop­er­ties and those which don’t, and this seems re­ally hard for users to do. And I guess, when I try to imag­ine analo­gies, I just don’t see many ex­am­ples of peo­ple suc­cess­fully solv­ing prob­lems like this, like busi­nesses mak­ing prod­ucts that are differ­ent lev­els of dan­ger­ous­ness, and then users suc­cess­fully buy­ing the safe ones.

Ro­hin Shah: I think usu­ally what hap­pens is you get reg­u­la­tion that forces ev­ery­one to be safe. I don’t know if it was reg­u­la­tion, but like air­planes are in­cred­ibly safe. Cars are in­cred­ibly safe.

Buck Sh­legeris: Yeah but in this case what would hap­pen is do­ing the un­safe thing al­lows you to make enor­mous amounts of money, and so the coun­tries which don’t put in the reg­u­la­tions are go­ing to be mas­sively ad­van­taged com­pared to ones which don’t.

Ro­hin Shah: Why doesn’t that ap­ply for cars and air­planes?

Buck Sh­legeris: So to start with, cars in poor coun­tries are a lot less safe. Another thing is that a lot of the effort in mak­ing safer cars and air­planes comes from de­sign­ing them. Once you’ve done the work of de­sign­ing it, it’s that much more ex­pen­sive to put your for­mally-ver­ified 747 soft­ware into more planes, and be­cause of weird fea­tures of the fact that there are only like two big plane man­u­fac­tur­ers, ev­ery­one gets the safer planes.

Lu­cas Perry: So ty­ing this into ro­bust­ness. The fun­da­men­tal con­cern here is about the in­cen­tives to make al­igned sys­tems that are safety and al­ign­ment ro­bust in the real world.

Ro­hin Shah: I think that’s ba­si­cally right. I sort of see these in­cen­tives as ex­ist­ing and the world gen­er­ally be­ing rea­son­ably good at deal­ing with high stakes prob­lems.

Buck Sh­legeris: What’s an ex­am­ple of the world be­ing good at deal­ing with a high stakes prob­lem?

Ro­hin Shah: I feel like biotech seems rea­son­ably well han­dled, rel­a­tively speak­ing,

Buck Sh­legeris: Like bio-se­cu­rity?

Ro­hin Shah: Yeah.

Buck Sh­legeris: Okay, if the world han­dles AI as well as bio-se­cu­rity, there’s no way we’re okay.

Ro­hin Shah: Really? I’m aware of ways in which we’re not do­ing bio-se­cu­rity well, but there seem to be ways in which we’re do­ing it well too.

Buck Sh­legeris: The nice thing about bio-se­cu­rity is that very few peo­ple are in­cen­tivized to kill ev­ery­one, and this means that it’s okay if you’re slop­pier about your reg­u­la­tions, but my un­der­stand­ing is that lots of reg­u­la­tions are pretty weak.

Ro­hin Shah: I guess I was more imag­in­ing the re­search com­mu­nity’s co­or­di­na­tion on this. Sur­pris­ingly good.

Buck Sh­legeris: I wouldn’t de­scribe it that way.

Ro­hin Shah: It seems like the vast ma­jor­ity of the re­search com­mu­nity is on­board with the right thing and like 1% isn’t. Yeah. Plau­si­bly we need to have reg­u­la­tions for that last 1%.

Buck Sh­legeris: I think that 99% of the syn­thetic biol­ogy re­search com­mu­nity is on board with “it would be bad if ev­ery­one died.” I think that some very small pro­por­tion is on­board with things like “we shouldn’t do re­search if it’s very dan­ger­ous and will make the world a lot worse.” I would say like way less than half of syn­thetic biol­o­gists seem to agree with state­ments like “it’s bad to do re­ally dan­ger­ous re­search.” Or like, “when you’re con­sid­er­ing do­ing re­search, you con­sider differ­en­tial tech­nolog­i­cal de­vel­op­ment.” I think this is just not a thing biol­o­gists think about, from my ex­pe­rience talk­ing to biol­o­gists.

Ro­hin Shah: I’d be in­ter­ested in bet­ting with you on this af­ter­wards.

Buck Sh­legeris: Me too.

Lu­cas Perry: So it seems like it’s go­ing to be difficult to come down to a con­crete un­der­stand­ing or agree­ment here on the in­cen­tive struc­tures in the world and whether they lead to the pro­lifer­a­tion of un­al­igned AI sys­tems or semi al­igned AI sys­tems ver­sus fully al­igned AI sys­tems and whether that poses a kind of lock-in, right? Would you say that that fairly sum­ma­rizes your con­cern Buck?

Buck Sh­legeris: Yeah. I ex­pect that Ro­hin and I agree mostly on the size of the co­or­di­na­tion prob­lem re­quired, or the costs that would be re­quired by try­ing to do things the safer way. And I think Ro­hin is just a lot more op­ti­mistic about those costs be­ing paid.

Ro­hin Shah: I think I’m op­ti­mistic both about peo­ple’s abil­ity to co­or­di­nate pay­ing those costs and about in­cen­tives point­ing to­wards pay­ing those costs.

Buck Sh­legeris: I think that Ro­hin is right that I dis­agree with him about the sec­ond of those as well.

Lu­cas Perry: Are you in­ter­ested in un­pack­ing this any­more? Are you happy to move on?

Buck Sh­legeris: I ac­tu­ally do want to talk about this for two more min­utes. I am re­ally sur­prised by the claim that hu­mans have solved co­or­di­na­tion prob­lems as hard as this one. I think the ex­am­ple you gave is hu­mans do­ing rad­i­cally nowhere near well enough. What are ex­am­ples of co­or­di­na­tion prob­lem type things… There was a bunch of stuff with nu­clear weapons, where I feel like hu­mans did badly enough that we definitely wouldn’t have been okay in an AI situ­a­tion. There are a bunch of ex­am­ples of the US se­cretly threat­en­ing peo­ple with nu­clear strikes, which I think is an ex­am­ple of some kind of co­or­di­na­tion failure. I don’t think that the world has suc­cess­fully co­or­di­nated on never threaten first nu­clear strikes. If we had suc­cess­fully co­or­di­nated on that, I would con­sider nu­clear weapons to be less of a failure, but as it is the US has ac­tu­ally ac­cord­ing to Daniel Ells­berg threat­ened a bunch of peo­ple with first strikes.

Ro­hin Shah: Yeah, I think I up­date less on spe­cific sce­nar­ios and up­date quite a lot more on, “it just never hap­pened.” The sheer amount of co­in­ci­dence that would be re­quired given the level of, Oh my God, there were close calls mul­ti­ple times a year for many decades. That seems just to­tally im­plau­si­ble and it just means that our un­der­stand­ing of what’s hap­pen­ing is wrong.

Buck Sh­legeris: Again, also the thing I’m imag­in­ing is this very grad­ual take­off world where peo­ple, ev­ery year, they re­lease their new most pow­er­ful AI sys­tems. And if, in a par­tic­u­lar year, AI Corp de­cided to not re­lease its thing, then AI Corps two and three and four would rise to be­ing one, two and three in to­tal prof­its in­stead of two, three and four. In that kind of a world, I feel a lot more pes­simistic.

Ro­hin Shah: I’m definitely imag­in­ing more of the case where they co­or­di­nate to all not do things. Either by in­ter­na­tional reg­u­la­tion or via the com­pa­nies them­selves co­or­di­nat­ing amongst each other. Even with­out that, it’s plau­si­ble that AI Corp one does this. One ex­am­ple I’d give is, Waymo has just been very slow to de­ploy self driv­ing cars rel­a­tive to all the other self driv­ing car com­pa­nies, and my im­pres­sion is that this is mostly be­cause of safety con­cerns.

Buck Sh­legeris: In­ter­est­ing and slightly per­sua­sive ex­am­ple. I would love to talk through this more at some point. I think this is re­ally im­por­tant and I think I haven’t heard a re­ally good con­ver­sa­tion about this.

Apolo­gies for de­scribing what I think is go­ing wrong in­side your mind or some­thing, which is gen­er­ally a bad way of say­ing things, but it sounds kind of to me like you’re im­plic­itly as­sum­ing more con­cen­trated ad­van­tage and fewer ac­tors than I think ac­tu­ally are im­plied by grad­ual take­off sce­nar­ios.

Ro­hin Shah: I’m usu­ally imag­in­ing some­thing like a 100+ com­pa­nies try­ing to build the next best AI sys­tem, and 10 or 20 of them be­ing clear front run­ners or some­thing.

Buck Sh­legeris: That makes sense. I guess I don’t quite see how the co­or­di­na­tion suc­cesses you were de­scribing arise in that kind of a world. But I am happy to move on.

Lu­cas Perry: So be­fore we move on on this point, is there any­thing which you would sug­gest as ob­vi­ous solu­tions, should Buck’s model of the risks here be the case. So it seemed like it would de­mand more cen­tral­ized in­sti­tu­tions which would help to miti­gate some of the lock in here.

Ro­hin Shah: Yeah. So there’s a lot of work in policy and gov­er­nance about this. Not much of which is pub­lic un­for­tu­nately. But I think the thing to say is that peo­ple are think­ing about it and it does sort of look like try­ing to figure out how to get the world to ac­tu­ally co­or­di­nate on things. But as Buck has pointed out, we have tried to do this be­fore and so there’s prob­a­bly a lot to learn from past cases as well. But I am not an ex­pert on this and don’t re­ally want to talk as though I were one.

Lu­cas Perry: All right. So there’s lots of gov­er­nance and co­or­di­na­tion thought that kind of needs to go into solv­ing many of these co­or­di­na­tion is­sues around de­vel­op­ing benefi­cial AI. So I think with that we can move along now to scal­ing to su­per­hu­man abil­ities. So Ro­hin, what do you have to say about this topic area?

Ro­hin Shah: I think this is in some sense re­lated to what we were talk­ing about be­fore, you can pre­dict what a hu­man would say, but it’s hard to back out true un­der­ly­ing val­ues be­neath them. Here the prob­lem is, sup­pose you are learn­ing from some sort of hu­man feed­back about what you’re sup­posed to be do­ing, the in­for­ma­tion con­tained in that tells you how to do what­ever the hu­man can do. It doesn’t re­ally tell you how to ex­ceed what the hu­man can do with­out hav­ing some ad­di­tional as­sump­tions.

Now, de­pend­ing on how the hu­man feed­back is struc­tured, this might lead to differ­ent things like if the hu­man is demon­strat­ing how to do the task to you, then this would sug­gest that it would be hard to do the task any bet­ter than the hu­man can, but if the hu­man was eval­u­at­ing how well you did the task, then you can do the task bet­ter in a way that the hu­man wouldn’t be able to tell was bet­ter. Ideally, at some point we would like to have AI sys­tems that can ac­tu­ally do just re­ally pow­er­ful, great things, that we are un­able to un­der­stand all the de­tails of and so we would nei­ther be able to demon­strate or eval­u­ate them.

How do we get to those sorts of AI sys­tems? The main pro­pos­als in this bucket are iter­ated am­plifi­ca­tion, de­bate, and re­cur­sive re­ward mod­el­ing. So in iter­ated am­plifi­ca­tion, we started with an ini­tial policy, and we al­ter­nate be­tween am­plifi­ca­tion and dis­til­la­tion, which in­creases ca­pa­bil­ities and effi­ciency re­spec­tively. This can en­code a bunch of differ­ent al­gorithms, but usu­ally am­plifi­ca­tion is done by de­com­pos­ing ques­tions into eas­ier sub ques­tions, and then us­ing the agent to an­swer those sub ques­tions. While dis­til­la­tion can be done us­ing su­per­vised learn­ing or re­in­force­ment learn­ing, so you get these an­swers that are cre­ated by these am­plified sys­tems that take a long time to run, and you just train a neu­ral net to very quickly pre­dict the an­swers with­out hav­ing to do this whole big de­com­po­si­tion thing. In de­bate, we train an agent through self play in a zero sum game where the agent’s goal is to win a ques­tion an­swer­ing de­bate as eval­u­ated by a hu­man judge. The hope here is that since both sides of the de­bate can point out flaws in the other side’s ar­gu­ments—they’re both very pow­er­ful AI sys­tems—such a set up can use a hu­man judge to train far more ca­pa­ble agents while still in­cen­tiviz­ing the agents to provide hon­est true in­for­ma­tion. With re­cur­sive re­ward mod­el­ing, you can think of it as an in­stan­ti­a­tion of the gen­eral al­ter­nate be­tween am­plifi­ca­tion and dis­til­la­tion frame­work, but it works sort of bot­tom up in­stead of top down. So you’ll start by build­ing AI sys­tems that can help you eval­u­ate sim­ple, easy tasks. Then use those AI sys­tems to help you eval­u­ate more com­plex tasks and you keep iter­at­ing this pro­cess un­til even­tu­ally you have AI sys­tems that help you with very com­plex tasks like how to de­sign the city. And this lets you then train an AI agent that can de­sign the city effec­tively even though you don’t to­tally un­der­stand why it’s do­ing the things it’s do­ing or why they’re even good.

Lu­cas Perry: Do ei­ther of you guys have any high level thoughts on any of these ap­proaches to scal­ing to su­per­hu­man abil­ities?

Buck Sh­legeris: I have some.

Lu­cas Perry: Go for it.

Buck Sh­legeris: So to start with, I think it’s worth not­ing that an­other ap­proach would be am­bi­tious value learn­ing, in the sense that I would phrase these not as ap­proaches for scal­ing to su­per­hu­man abil­ities, but they’re like ap­proaches for scal­ing to su­per­hu­man abil­ities while only do­ing tasks that re­late to the ac­tual be­hav­ior of hu­mans rather than try­ing to back out their val­ues ex­plic­itly. Does that match your thing Ro­hin?

Ro­hin Shah: Yeah, I agree. I of­ten phrase that as with am­bi­tious value learn­ing, there’s not a clear ground truth to be fo­cus­ing on, whereas with all three of these meth­ods, the ground truth is what a hu­man would do if they got a very, very long time to think or at least that is what they’re try­ing to ap­prox­i­mate. It’s a lit­tle tricky to see why ex­actly they’re ap­prox­i­mat­ing that, but there are some good posts about this. The key differ­ence be­tween these tech­niques and am­bi­tious value learn­ing is that there is in some sense a ground truth that you are try­ing to ap­prox­i­mate.

Buck Sh­legeris: I think these are all kind of ex­cit­ing ideas. I think they’re all kind of bet­ter ideas than I ex­pected to ex­ist for this prob­lem a few years ago. Which prob­a­bly means we should up­date against my abil­ity to cor­rectly judge how hard AI safety prob­lems are, which is great news, in as much as I think that a lot of these prob­lems are re­ally hard. Nev­er­the­less, I don’t feel su­per op­ti­mistic that any of them are ac­tu­ally go­ing to work. One thing which isn’t in the ele­va­tor pitch for IDA, which is iter­ated dis­til­la­tion and am­plifi­ca­tion (and de­bate), is that you get to hire the hu­mans who are go­ing to be pro­vid­ing the feed­back, or the hu­mans whose an­swers AI sys­tems are go­ing to be trained with. And this is ac­tu­ally re­ally great. Be­cause for in­stance, you could have this pro­gram where you hire a bunch of peo­ple and you put them through your one month long train­ing an AGI course. And then you only take the top 50% of them. I feel a lot more op­ti­mistic about these pro­pos­als given you’re al­lowed to think re­ally hard about how to set it up such that the hu­mans have the eas­iest time pos­si­ble. And this is one rea­son why I’m op­ti­mistic about peo­ple do­ing re­search in fac­tored cog­ni­tion and stuff, which I’m sure Ro­hin’s go­ing to ex­plain in a bit.

One com­ment about re­cur­sive re­ward mod­el­ing: it seems like it has a lot of things in com­mon with IDA. The main down­side that it seems to have to me is that the hu­man is in charge of figur­ing out how to de­com­pose the task into eval­u­a­tions at a va­ri­ety of lev­els. Whereas with IDA, your sys­tem it­self is able to nat­u­rally de­com­pose the task into a va­ri­ety lev­els, and for this rea­son I feel a bit more op­ti­mistic about IDA.

Ro­hin Shah: With re­cur­sive re­ward mod­el­ing, one agent that you can train is just an agent that’s good at do­ing de­com­po­si­tions. That is a thing you can do with it. It’s a thing that the peo­ple at Deep­Mind are think­ing about.

Buck Sh­legeris: Yep, that’s a re­ally good point.

Ro­hin Shah: I also strongly like the fact that you can train your hu­mans to be good at pro­vid­ing feed­back. This is also true about speci­fi­ca­tion learn­ing. It’s less clear if it’s true about am­bi­tious value learn­ing. No one’s re­ally pro­posed how you could do am­bi­tious value learn­ing re­ally. Maybe ar­guably Stu­art Rus­sell’s book is kind of a pro­posal, but it doesn’t have that many de­tails.

Buck Sh­legeris: And, for ex­am­ple, it doesn’t ad­dress any of my con­cerns in ways that I find per­sua­sive.

Ro­hin Shah: Right. But for speci­fi­ca­tion learn­ing also you definitely want to train the hu­mans who are go­ing to be pro­vid­ing feed­back to the AI sys­tem. That is an im­por­tant part of why you should ex­pect this to work.

Buck Sh­legeris: I of­ten give talks where I try to give an in­tro­duc­tion to IDA and de­bate as a pro­posal for AI al­ign­ment. I’m giv­ing these talks to peo­ple with com­puter sci­ence back­grounds, and they’re al­most always in­cred­ibly skep­ti­cal that it’s ac­tu­ally pos­si­ble to de­com­pose thought in this kind of a way. And with de­bate, they’re very skep­ti­cal that truth wins, or that the nash equil­ibrium is ac­cu­racy. For this rea­son I’m su­per en­thu­si­as­tic about re­search into the fac­tored cog­ni­tion hy­poth­e­sis of the type that Ought is do­ing some of.

I’m kind of in­ter­ested in your over­all take for how likely it is that the fac­tored cog­ni­tion hy­poth­e­sis holds and that it’s ac­tu­ally pos­si­ble to do any of this stuff, Ro­hin. You could also ex­plain what that is.

Ro­hin Shah: I’ll do that. So ba­si­cally with both iter­ated am­plifi­ca­tion, de­bate, or re­cur­sive re­ward mod­el­ing, they all hinge on this idea of be­ing able to de­com­pose ques­tions, maybe it’s not so ob­vi­ous why that’s true for de­bate, but it’s true. Go listen to the pod­cast about de­bate if you want to get more de­tails on that.

So this hy­poth­e­sis is ba­si­cally for any tasks that we care about, it is pos­si­ble to de­com­pose this into a bunch of sub tasks that are all eas­ier to do. Such that if you’re able to do the sub tasks, then you can do the over­all top level tasks and in par­tic­u­lar you can iter­ate this down, build­ing a tree of smaller and smaller tasks un­til you can get to the level of tasks that a hu­man could do in a day. Or if you’re try­ing to do it very far, maybe tasks that a hu­man can do in a cou­ple of min­utes. Whether or not you can ac­tu­ally de­com­pose the task “be an effec­tive CEO” into a bunch of sub tasks that even­tu­ally bot­tom out into things hu­mans can do in a few min­utes is to­tally un­clear. Some peo­ple are op­ti­mistic, some peo­ple are pes­simistic. It’s called the fac­tored cog­ni­tion hy­poth­e­sis and Ought is an or­ga­ni­za­tion that’s study­ing it.

It sounds very con­tro­ver­sial at first and I, like many other peo­ple had the in­tu­itive re­ac­tion of, ‘Oh my God, this is never go­ing to work and it’s not true’. I think the thing that ac­tu­ally makes me op­ti­mistic about it is you don’t have to do what you might call a di­rect de­com­po­si­tion. You can do things like if your task is to be an effec­tive CEO, your first sub ques­tion could be, what are the im­por­tant things to think about when be­ing a CEO or some­thing like this, as op­posed to usu­ally when I think of de­com­po­si­tions I would think of, first I need to deal with hiring. Maybe I need to un­der­stand HR, maybe I need to un­der­stand all of the met­rics that the com­pany is op­ti­miz­ing. Very ob­ject level con­cerns, but the de­com­po­si­tions are to­tally al­lowed to also be meta level where you’ll spin off a bunch of com­pu­ta­tion that is just try­ing to an­swer the meta level of ques­tion of how should I best think about this ques­tion at all.

Another im­por­tant rea­son for op­ti­mism is that based on the struc­ture of iter­ated am­plifi­ca­tion, de­bate and re­cur­sive re­ward mod­el­ing, this tree can be gi­gan­tic. It can be ex­po­nen­tially large. Some­thing that we couldn’t run even if we had all of the hu­mans on Earth col­lab­o­rat­ing to do this. That’s okay. Given how the train­ing pro­cess is struc­tured, con­sid­er­ing the fact that you can do the equiv­a­lent of mil­len­nia of per­son years of effort in this de­com­posed tree, I think that also gives me more of a, ‘okay, maybe this is pos­si­ble’ and that’s also why you’re able to do all of this meta level think­ing be­cause you have a com­pu­ta­tional bud­get for it. When you take all of those to­gether, I sort of come up with “seems pos­si­ble. I don’t re­ally know.”

Buck Sh­legeris: I think I’m cur­rently at 30-to-50% on the fac­tored cog­ni­tion thing ba­si­cally work­ing out. Which isn’t noth­ing.

Ro­hin Shah: Yeah, that seems like a perfectly rea­son­able thing. I think I could imag­ine putting a day of thought into it and com­ing up with num­bers any­where be­tween 20 and 80.

Buck Sh­legeris: For what it’s worth, in con­ver­sa­tion at some point in the last few years, Paul Chris­ti­ano gave num­bers that were not wildly more op­ti­mistic than me. I don’t think that the peo­ple who are work­ing on this think it’s ob­vi­ously fine. And it would be great if this stuff works, so I’m re­ally in fa­vor of peo­ple look­ing into it.

Ro­hin Shah: Yeah, I should men­tion an­other key in­tu­ition against it. We have all these ex­am­ples of hu­man ge­niuses like Ra­manu­jan, who were posed very difficult math prob­lems and just im­me­di­ately get the an­swer and then you ask them how did they do it and they say, well, I asked my­self what should the an­swer be? And I was like, the an­swer should be a con­tinued frac­tion. And then I asked my­self which con­tinued frac­tion and then I got the an­swer. And you’re like, that does not sound very de­com­pos­able. It seems like you need these magic flashes of in­tu­ition. Those would be the hard cases for fac­tored cog­ni­tion. It still seems pos­si­ble that you could do it by both this ex­po­nen­tial try a bunch of pos­si­bil­ities and also by be­ing able to dis­cover in­tu­itions that work in prac­tice and just be­liev­ing them be­cause they work in prac­tice and then ap­ply­ing them to the prob­lem at hand. You could imag­ine that with enough com­pu­ta­tion you’d be able to dis­cover such in­tu­itions.

Buck Sh­legeris: You can’t an­swer a math prob­lem by search­ing ex­po­nen­tially much through the search tree. The only ex­po­nen­tial power you get from IDA is IDA is let­ting you spec­ify the out­put of your cog­ni­tive pro­cess in such a way that’s go­ing to match some ex­po­nen­tially sized hu­man pro­cess. As long as that ex­po­nen­tially sized hu­man pro­cess was only ex­po­nen­tially sized be­cause it’s re­ally in­effi­cient, but is kind of fun­da­men­tally not an ex­po­nen­tially sized prob­lem, then your ma­chine learn­ing should be able to speed it up a bunch. But the thing where you search over search strat­egy is not valid. If that’s all you can do, that’s not good enough.

Ro­hin Shah: Search­ing over search strate­gies, I agree you can’t do, but if you have an ex­po­nen­tial search that could be im­ple­mented by hu­mans. We know by hy­poth­e­sis, if you can solve it with a flash of in­tu­ition, there is in fact some more effi­cient way to do it and so whether or not the dis­til­la­tion steps will ac­tu­ally be enough to get to the point where you can do those flashes of in­tu­ition. That’s an open ques­tion.

Buck Sh­legeris: This is one of my fa­vorite ar­eas of AI safety re­search and I would love for there to be more of it. Some­thing I have been float­ing for a lit­tle while is I kind of wish that there was an­other Ought. It just seems like it would be so good if we had defini­tive in­for­ma­tion about the fac­tored cog­ni­tion hy­poth­e­sis. And it also it seems like the kind of thing which is po­ten­tially par­alleliz­able. And I feel like I know a lot of peo­ple who love talk­ing about how think­ing works. A lot of ra­tio­nal­ists are re­ally into this. I would just be su­per ex­cited for some of them to form teams of four and go off on their own and build an Ought com­peti­tor. I feel like this is the kind of thing where plau­si­bly, a bunch of en­thu­si­as­tic peo­ple could make progress on their own.

Ro­hin Shah: Yeah, I agree with that. Definitely seems like one of the higher value things but I might be more ex­cited about uni­ver­sal­ity.

Lu­cas Perry: All right, well let’s get started with uni­ver­sal­ity then. What is uni­ver­sal­ity and why are you op­ti­mistic about it?

Ro­hin Shah: So uni­ver­sal­ity is hard to ex­plain well, in a sin­gle sen­tence. For what­ever su­per­vi­sor is train­ing our agent, you want that su­per­vi­sor to “know ev­ery­thing the agent knows.” In par­tic­u­lar if the agent comes up with some de­cep­tive strat­egy to look like it’s achiev­ing the goal, but ac­tu­ally it hasn’t. The su­per­vi­sors should know that it was do­ing this de­cep­tive strat­egy for the rea­son of try­ing to trick the su­per­vi­sor and so the su­per­vi­sor can then pe­nal­ize it. The clas­sic ex­am­ple of why this is im­por­tant and hard also due to Paul Chris­ti­ano is pla­gia­rism. Sup­pose you are train­ing on the AI sys­tem to pro­duce novel works of liter­a­ture and as part of its train­ing data, the AI sys­tem gets to read this library of a mil­lion books.

It’s pos­si­ble that this AI sys­tem de­cides, Hey, you know the best way I can make a great novel seem­ing book is to just take these five books and take out plot points, pas­sages from each of them and put them to­gether and then this new book will look to­tally novel and will be very good be­cause I used all of the best Shake­spearean writ­ing or what­ever. If your su­per­vi­sor doesn’t know that the agent has done this, the only way the su­per­vi­sor can re­ally check is to go read the en­tire mil­lion books. Even if the agent only read 10 books and so then the su­per­vi­sion be­comes a way more costly than run­ning the agent, which is not a great state to be in, and so what you re­ally want is that if the agent does this, the su­per­vi­sor is able to say, I see that you just copied this stuff over from these other books in or­der to trick me into think­ing that you had writ­ten some­thing novel that was good.

That’s bad. I’m pe­nal­iz­ing you. Stop do­ing that in the fu­ture. Now, this sort of prop­erty, I mean it’s very nice in the ab­stract, but who knows whether or not we can ac­tu­ally build it in prac­tice. There’s some rea­son for op­ti­mism that I don’t think I can ad­e­quately con­vey, but I wrote a newslet­ter sum­ma­riz­ing some of it some­time ago, but again, read­ing through the posts I be­came more op­ti­mistic that it was an achiev­able prop­erty, than when I first heard what the prop­erty was. The rea­son I’m op­ti­mistic about it is that it just sort of seems to cap­ture the thing that we ac­tu­ally care about. It’s not ev­ery­thing, like it doesn’t solve the ro­bust­ness prob­lem. Univer­sal­ity only tells you what the agent’s cur­rently do­ing. You know all the facts about that. Whereas for ro­bust­ness you want to say even in these hy­po­thet­i­cal situ­a­tions that the agent hasn’t en­coun­tered yet and doesn’t know stuff about, even when it en­coun­ters those situ­a­tions, it’s go­ing to stay al­igned with you so uni­ver­sal­ity doesn’t get you all the way there, but it definitely feels like it’s get­ting you quite a bit.

Buck Sh­legeris: That’s re­ally in­ter­est­ing to hear you phrase it that way. I guess I would have thought of uni­ver­sal­ity as a sub­set of ro­bust­ness. I’m cu­ri­ous what you think of that first.

Ro­hin Shah: I definitely think you could use uni­ver­sal­ity to achieve a sub­set of ro­bust­ness. Maybe I would say uni­ver­sal­ity is a sub­set of in­ter­pretabil­ity.

Buck Sh­legeris: Yeah, and I care about in­ter­pretabil­ity as a sub­set of ro­bust­ness ba­si­cally, or as a sub­set of in­ner al­ign­ment, which is pretty close to ro­bust­ness in my mind. The other thing I would say is you were say­ing there that one differ­ence be­tween uni­ver­sal­ity and ro­bust­ness is that uni­ver­sal­ity only tells you why the agent did the thing it cur­rently did, and this doesn’t suffice to tell us about the situ­a­tions that the agent isn’t cur­rently in. One re­ally nice thing though is that if the agent is only act­ing a par­tic­u­lar way be­cause it wants you to trust it, that’s a fact about its cur­rent be­hav­ior that you will know, and so if you have the uni­ver­sal­ity prop­erty, your over­seer just knows your agent is try­ing to de­ceive it. Which seems like it would be in­cred­ibly great and would re­solve like half of my prob­lem with safety if you had it.

Ro­hin Shah: Yeah, that seems right. The case that uni­ver­sal­ity doesn’t cover is when your AI sys­tem is ini­tially not de­cep­tive, but then at some point in the fu­ture it’s like, ‘Oh my God, now it’s pos­si­ble to go and build Dyson spheres or some­thing, but wait, in this situ­a­tion prob­a­bly I should be do­ing this other thing and hu­mans won’t like that. Now I bet­ter de­ceive hu­mans’. The tran­si­tion into de­cep­tion would have to be a sur­prise in some sense even to the AI sys­tem.

Buck Sh­legeris: Yeah, I guess I’m just not wor­ried about that. Sup­pose I have this sys­tem which is as smart as a rea­son­ably smart hu­man or 10 rea­son­ably smart hu­mans, but it’s not as smart as the whole world. If I can just ask it what its best sense about how al­igned it is, is? And if I can trust its an­swer? I don’t know man, I’m pretty okay with sys­tems that think they’re al­igned, an­swer­ing that ques­tion hon­estly.

Ro­hin Shah: I think I some­what agree. I like this re­ver­sal where I’m the pes­simistic one.

Buck Sh­legeris: Yeah me too. I’m like, “look, sys­tem, I want you to think as hard as you can to come up with the best ar­gu­ments you can come up with for why you are mis­al­igned, and the prob­lems with you.” And if I just ac­tu­ally trust the sys­tem to get this right, then the bad out­comes I get here are just pure ac­ci­dents. I just had this ter­rible ini­tial­iza­tion of my neu­ral net pa­ram­e­ters, such that I had this sys­tem that hon­estly be­lieved that it was go­ing to be al­igned. And then as it got trained more, this sud­denly changed and I couldn’t do any­thing about it. I don’t quite see the story for how this goes su­per wrong. It seems a lot less bad than the de­fault situ­a­tion.

Ro­hin Shah: Yeah. I think the story I would tell is some­thing like, well, if you look at hu­mans, they’re pretty wrong about what their prefer­ences will be in the fu­ture. For ex­am­ple, there’s this trope of how teenagers fall in love and then fall out of love, but when they’re in love, they swear undy­ing oaths to each other or some­thing. To the ex­tent that is true, that seems like the sort of failure that could lead to x-risk if it also hap­pened with AI sys­tems.

Buck Sh­legeris: I feel pretty op­ti­mistic about all the gar­den-va­ri­ety ap­proaches to solv­ing this. Teenagers were not se­lected very hard on ac­cu­racy of their undy­ing oaths. And if you in­stead had ac­cu­racy of self-model as a key fea­ture you were se­lect­ing for in your AI sys­tem, plau­si­bly you’ll just be way more okay.

Ro­hin Shah: Yeah. Maybe peo­ple could co­or­di­nate well on this. I feel less good about peo­ple co­or­di­nat­ing on this sort of prob­lem.

Buck Sh­legeris: For what it’s worth, I think there are co­or­di­na­tion prob­lems here and I feel like my pre­vi­ous ar­gu­ment about why co­or­di­na­tion is hard and won’t hap­pen by de­fault also prob­a­bly ap­plies to us not be­ing okay. I’m not sure how this all plays out. I’d have to think about it more.

Ro­hin Shah: Yeah. I think it’s more like this is a sub­tle and non-ob­vi­ous prob­lem, which by hy­poth­e­sis doesn’t hap­pen in the sys­tems you ac­tu­ally have and only hap­pens later and those are the sorts of prob­lems I’m like, Ooh, not sure if we can deal with those ones, but I agree that there’s a good chance that there’s just not a prob­lem at all in the world where we already have uni­ver­sal­ity and checked all the ob­vi­ous stuff.

Buck Sh­legeris: Yeah. I would like to say uni­ver­sal­ity is one of my other fa­vorite ar­eas of AI al­ign­ment re­search, in terms of how happy I’d be if it worked out re­ally well.

Lu­cas Perry: All right, so let’s see if we can slightly pick up the pace here. Mov­ing for­ward and start­ing with in­ter­pretabil­ity.

Ro­hin Shah: Yeah, so I mean I think we’ve ba­si­cally dis­cussed in­ter­pretabil­ity already. Univer­sal­ity is a spe­cific kind of in­ter­pretabil­ity, but the case for in­ter­pretabil­ity is just like, sure seems like it would be good if you could un­der­stand what your AI sys­tems are do­ing. You could then no­tice when they’re not al­igned, and fix that some­how. It’s a pretty clear cut case for a thing that would be good if we achieved it and it’s still pretty un­cer­tain how likely we are to be able to achieve it.

Lu­cas Perry: All right, so let’s keep it mov­ing and let’s hit im­pact reg­u­lariza­tion now.

Ro­hin Shah: Yeah, im­pact reg­u­lariza­tion in par­tic­u­lar is one of the ideas that are not try­ing to al­ign the AI sys­tem but are in­stead try­ing to say, well, what­ever AI sys­tem we build, let’s make sure it doesn’t cause a catas­tro­phe. It doesn’t lead to ex­tinc­tion or ex­is­ten­tial risk. What it hopes to do is say, all right, AI sys­tem, do what­ever it is you wanted to do. I don’t care about that. Just make sure that you don’t have a huge im­pact upon the world.

What­ever you do, keep your im­pact not too high. And so there’s been a lot of work on this in re­cent years there’s been rel­a­tive reach­a­bil­ity, at­tain­able util­ity preser­va­tion, and I think in gen­eral the sense is like, wow, it’s gone quite a bit fur­ther than peo­ple ex­pected it to go. I think it definitely does pre­vent you from do­ing very, very pow­er­ful things of the sort, like if you wanted to stop all com­pet­ing AI pro­jects from ever be­ing able to build AGI, that doesn’t seem like the sort of thing you can do with an im­pact reg­u­larized AI sys­tem, but it sort of seems plau­si­ble that you could pre­vent con­ver­gent in­stru­men­tal sub goals us­ing im­pact reg­u­lariza­tion. Where AI sys­tems that are try­ing to steal re­sources and power from hu­mans, you could imag­ine that you’d say, hey, don’t do that level of im­pact, you can still have the level of im­pact of say run­ning a com­pany or some­thing like that.

Buck Sh­legeris: My take on all this is that I’m pretty pes­simistic about all of it work­ing. I think that im­pact reg­u­lariza­tion or what­ever is a non-op­ti­mal point on the ca­pa­bil­ities /​ al­ign­ment trade off or some­thing, in terms of safety you’re get­ting for how much ca­pa­bil­ity you’re sac­ri­fic­ing. My ba­sic a prob­lem here is ba­si­cally analo­gous to my prob­lem with value learn­ing, where I think we’re try­ing to take these ex­tremely es­sen­tially fuzzy con­cepts and then fac­tor our agent through these fuzzy con­cepts like im­pact, and ba­si­cally the thing that I imag­ine hap­pen­ing is any im­pact reg­u­lariza­tion strat­egy you try to em­ploy, if your AI is us­able, will end up not helping with its al­ign­ment. For any defi­ni­tion of im­pacts you come up with, it’ll end up do­ing some­thing which gets around that. Or it’ll make your AI sys­tem com­pletely use­less, is my ba­sic guess as to what hap­pens.

Ro­hin Shah: Yeah, so I think again in this set­ting, if you for­mal­ize it and then say, con­sider the op­ti­mal agent. Yeah, that can to­tally get around your im­pact penalty, but in prac­tice it sure seems like, what you want to do is say this con­ver­gent in­stru­men­tal sub­goal stuff, don’t do any of that. Con­tinue to do things that are nor­mal in reg­u­lar life. And those seem like pretty dis­tinct cat­e­gories. Such that I would not be shocked if we could ac­tu­ally dis­t­in­guish be­tween the two.

Buck Sh­legeris: It sounds like the main benefit you’re go­ing for is try­ing to make your AI sys­tem not do in­sane, con­ver­gent, in­stru­men­tal sub-goal style stuff. So an­other ap­proach I can imag­ine tak­ing here would be some kind of value learn­ing or some­thing, where you’re ask­ing hu­mans for feed­back on whether plans are in­sanely con­ver­gent, in­stru­men­tal sub-goal style, and just not do­ing the things which, when hu­mans are asked to rate how sketchy the plans are the hu­mans rate as suffi­ciently sketchy? That seems like about as good a plan. I’m cu­ri­ous what you think.

Ro­hin Shah: The idea of power as your at­tain­able util­ity across a wide va­ri­ety of util­ity func­tions seems like a pretty good for­mal­iza­tion to me. I think in the wor­lds where I ac­tu­ally buy a for­mal­iza­tion, I tend to ex­pect the for­mal­iza­tion to work bet­ter. I do think the for­mal­iza­tion is not perfect. Most no­tably with the cur­rent for­mal­iza­tion of power, your power never changes if you have ex­tremely good be­liefs. Your no­tion, you’re just like, I always have the same power be­cause I’m always able to do the same things and you never get sur­prised, so maybe I agree with you be­cause I think the cur­rent for­mal­iza­tion is not good enough. Yeah, I think I agree with you but I could see it go­ing ei­ther way.

Buck Sh­legeris: I could be to­tally wrong about this, and cor­rect me if I’m wrong, my sense is that you have to be able to back out the agent’s util­ity func­tion or its mod­els of the world. Which seems like it’s as­sum­ing a par­tic­u­lar path for AI de­vel­op­ment which doesn’t seem to me par­tic­u­larly likely.

Ro­hin Shah: I definitely agree with that for all the cur­rent meth­ods too.

Buck Sh­legeris: So it’s like: as­sume that we have already perfectly solved our prob­lems with uni­ver­sal­ity and ro­bust­ness and trans­parency and what­ever else. I feel like you kind of have to have solved all of those prob­lems be­fore you can do this, and then you don’t need it or some­thing.

Ro­hin Shah: I don’t think I agree with that. I definitely agree that the cur­rent al­gorithms that peo­ple have writ­ten as­sume that you can just make a change to the AI’s util­ity func­tion. I don’t think that’s what even their pro­po­nents would sug­gest as the ac­tual plan.

Buck Sh­legeris: What is the ac­tual plan?

Ro­hin Shah: I don’t ac­tu­ally know what their ac­tual plan would be, but one plan I could imag­ine is figure out what ex­actly the con­cep­tual things we have to do with im­pact mea­sure­ment are, and then what­ever method we have for build­ing AGI, prob­a­bly there’s go­ing to be some part which is spec­ify the goal and then in the spec­ify goal part, in­stead of just say­ing pur­sue X, we want to say pur­sue X with­out chang­ing your abil­ity to pur­sue Y, and Z and W, and P, and Q.

Buck Sh­legeris: I think that that does not sound like a good plan. I don’t think that we should ex­pect our AI sys­tems to be struc­tured that way in the fu­ture.

Ro­hin Shah: Plau­si­bly we have to do this with nat­u­ral lan­guage or some­thing.

Buck Sh­legeris: It seems very likely to me that the thing you do is re­in­force­ment learn­ing where at the start of the epi­sode you get a sen­tence of English which is tel­ling you what your goal is and then blah, blah, blah, blah, blah, and this seems like a pretty rea­son­able strat­egy for mak­ing pow­er­ful and sort of al­igned AI. Aligned enough to be us­able for things that aren’t very hard. But you just fun­da­men­tally don’t have ac­cess to the in­ter­nal rep­re­sen­ta­tions that the AI is us­ing for its sense of what be­lief is, and stuff like that. And that seems like a re­ally big prob­lem.

Ro­hin Shah: I definitely see this as more of an outer al­ign­ment thing, or like an eas­ier to spec­ify outer al­ign­ment type thing than say, IDA is that type stuff.

Buck Sh­legeris: Okay, I guess that makes sense. So we’re just like as­sum­ing we’ve solved all the in­ner al­ign­ment prob­lems?

Ro­hin Shah: In the story so far yeah, I think all of the re­searchers who ac­tu­ally work on this haven’t thought much about in­ner al­ign­ment.

Buck Sh­legeris: My over­all sum­mary is that I re­ally don’t like this plan. I feel like it’s not ro­bust to scale. As you were say­ing Ro­hin, if your sys­tem gets more and more ac­cu­rate be­liefs, stuff breaks. It just feels like the kind of thing that doesn’t work.

Ro­hin Shah: I mean, it’s definitely not con­cep­tu­ally neat and el­e­gant in the sense of it’s not at­tack­ing the un­der­ly­ing prob­lem. And in a prob­lem set­ting where you ex­pect ad­ver­sar­ial op­ti­miza­tion type dy­nam­ics, con­cep­tual el­e­gance ac­tu­ally does count for quite a lot in whether or not you be­lieve your solu­tion will work.

Buck Sh­legeris: I feel it’s like try­ing to add edge de­tec­tors to your image clas­sifiers to make them more ad­ver­sar­ily ro­bust or some­thing, which is back­wards.

Ro­hin Shah: Yeah, I think I agree with that gen­eral per­spec­tive. I don’t ac­tu­ally know if I’m more op­ti­mistic than you. Maybe I just don’t say… Maybe we’d have the same un­cer­tainty dis­tri­bu­tions and you just say yours more strongly or some­thing.

Lu­cas Perry: All right, so then let’s just move a lit­tle quickly through the next three, which are causal mod­el­ing, or­a­cles, and de­ci­sion the­ory.

Ro­hin Shah: Yeah, I mean, well de­ci­sion the­ory, MIRI did some work on it. I am not the per­son to ask about it, so I’m go­ing to skip that one. Even if you look at the long ver­sion, I’m just like, here are some posts. Good luck. So causal mod­el­ing, I don’t fully un­der­stand what the over­all story is here but the ac­tual work that’s been pub­lished is ba­si­cally what we can do is we can take po­ten­tial plans or train­ing pro­cesses for AI sys­tems. We can write down causal mod­els that tell us how the var­i­ous pieces of the train­ing sys­tem in­ter­act with each other and then us­ing al­gorithms de­vel­oped for causal mod­els we can tell when an AI sys­tem would have an in­cen­tive to ei­ther ob­serve or in­ter­vene on an un­der­ly­ing vari­able.

One thing that came out of this was that you can build a model-based re­in­force­ment learner that doesn’t have any in­cen­tive to wire head as long as when it makes its plans, the plans are eval­u­ated by the cur­rent re­ward func­tion as op­posed to what­ever fu­ture re­ward func­tion it would have. And that was ex­plained us­ing this frame­work of causal mod­el­ing. Or­a­cles, Or­a­cles are ba­si­cally the idea that we can just train an AI sys­tem to just an­swer ques­tions, give it a ques­tion and it tries to figure out the best an­swer it can to that ques­tion, pri­ori­tiz­ing ac­cu­racy.

One worry that peo­ple have re­cently been talk­ing about is the pre­dic­tions that the Or­a­cle makes then af­fect the world, which can af­fect whether or not the pre­dic­tion was cor­rect. Like maybe if I pre­dict that I will go to bed at 11 then I’m more likely to ac­tu­ally go to bed at 11 be­cause I want my pre­dic­tion to come true or some­thing and so then the Or­a­cles can still “choose” be­tween differ­ent self con­firm­ing pre­dic­tions and so that gives them a source of agency and one way that peo­ple want to avoid this is us­ing what are called counter-fac­tual Or­a­cles where you set up the train­ing, such that the Or­a­cles are ba­si­cally mak­ing pre­dic­tions un­der the as­sump­tion that their pre­dic­tions are not go­ing to in­fluence the fu­ture.

Lu­cas Perry: Yeah, okay. Or­a­cles seem like they just won’t hap­pen. There’ll be in­cen­tives to make things other than Or­a­cles and that Or­a­cles would even be able to ex­ert in­fluence upon the world in other ways.

Ro­hin Shah: Yeah, I think I agree that Or­a­cles do not seem very com­pet­i­tive.

Lu­cas Perry: Let’s do fore­cast­ing now then.

Ro­hin Shah: So the main sub things within fore­cast­ing one, there’s just been a lot of work re­cently on ac­tu­ally build­ing good fore­cast­ing tech­nol­ogy. There has been an AI spe­cific ver­sion of Me­tac­u­lus that’s been go­ing on for a while now. There’s been some work at the Fu­ture of Hu­man­ity In­sti­tute on build­ing bet­ter tools for work­ing with prob­a­bil­ity dis­tri­bu­tions un­der record­ing and eval­u­at­ing fore­casts. There was an AI re­s­olu­tion coun­cil where ba­si­cally now you can make fore­casts about what this par­tic­u­lar group of peo­ple will think in five years or some­thing like that, which is much eas­ier to op­er­a­tional­ize than most other kinds of fore­casts. So this helps with con­struct­ing good ques­tions. On the ac­tual ob­ject level, I think there are two main things. One is that it be­came in­creas­ingly more ob­vi­ous in the past two years that AI progress cur­rently is be­ing driven by larger and larger amounts of com­pute.

It to­tally could be driven by other things as well, but at the very least, com­pute is a pretty im­por­tant fac­tor. And then take­off speeds. So there’s been this long de­bate in the AI safety com­mu­nity over whether—to take the ex­tremes, whether or not we should ex­pect that AI ca­pa­bil­ities will see a very sharp spike. So ini­tially, your AI ca­pa­bil­ities are im­prov­ing by like one unit a year, maybe then with some im­prove­ments it got to two units a year and then for what­ever rea­son, sud­denly they’re now at 20 units a year or a hun­dred units a year and they just swoop way past what you would get by ex­trap­o­lat­ing past trends, and so that’s what we might call a dis­con­tin­u­ous take­off. If you pre­dict that that won’t hap­pen in­stead you’ll get AI that’s ini­tially im­prov­ing at one unit per year. Then maybe two units per year, maybe three units per year. Then five units per year, and the rate of progress con­tinu­ally in­creases. The world’s still gets very, very crazy, but in a sort of grad­ual, con­tin­u­ous way that would be called a con­tin­u­ous take­off.

Ba­si­cally there were two posts that ar­gued pretty force­fully for con­tin­u­ous take­off back in, I want to say Fe­bru­ary of 2018, and this at least made me be­lieve that con­tin­u­ous take­off was more likely. Sadly, we just haven’t ac­tu­ally seen much defense of the other side of the view since then. Even though we do know that there definitely are peo­ple who still be­lieve the other side, that there will be a dis­con­tin­u­ous take­off.

Lu­cas Perry: Yeah so what are both you guys’ views on them?

Buck Sh­legeris: Here are a cou­ple of things. One is that I re­ally love the op­er­a­tional­iza­tion of slow take off or con­tin­u­ous take off that Paul pro­vided in his post, which was one of the ones Ro­hin was refer­ring to from Fe­bru­ary 2018. He says, “by slow take­off, I mean that there is a four year dou­bling of the econ­omy be­fore there is a one year dou­bling of the econ­omy.” As in, there’s a pe­riod of four years over which world GDP in­creases by a fac­tor four, af­ter which there is a pe­riod of one year. As op­posed to a situ­a­tion where the first one-year dou­bling hap­pens out of nowhere. Cur­rently, dou­bling times for the econ­omy are on the or­der of like 20 years, and so a one year dou­bling would be a re­ally big deal. The way that I would phrase why we care about this, is be­cause wor­lds where we have wide­spread, hu­man level AI feel like they have in­cred­ibly fast eco­nomic growth. And if it’s true that we ex­pect AI progress to in­crease grad­u­ally and con­tin­u­ously, then one im­por­tant con­se­quence of this is that by the time we have hu­man level AI sys­tems, the world is already to­tally in­sane. A four year dou­bling would just be crazy. That would be eco­nomic growth dras­ti­cally higher than eco­nomic growth cur­rently is.

This means it would be ob­vi­ous to ev­ery­one who’s pay­ing at­ten­tion that some­thing is up and the world is rad­i­cally chang­ing in a rapid fash­ion. Another way I’ve been think­ing about this re­cently is peo­ple talk about trans­for­ma­tive AI, by which they mean AI which would have at least as much of an im­pact on the world as the in­dus­trial rev­olu­tion had. And it seems plau­si­ble to me that oc­to­pus level AI would be trans­for­ma­tive. Like sup­pose that AI could just never get bet­ter than oc­to­pus brains. This would be way smaller of a deal than I ex­pect AI to ac­tu­ally be, but it would still be a mas­sive deal, and would still pos­si­bly lead to a change in the world that I would call trans­for­ma­tive. And if you think this is true, and if you think that we’re go­ing to have oc­to­pus level AI be­fore we have hu­man level AI, then you should ex­pect that rad­i­cal changes that you might call trans­for­ma­tive have hap­pened by the time that we get to the AI al­ign­ment prob­lems that we’ve been wor­ry­ing about. And if so, this is re­ally big news.

When I was read­ing about this stuff when I was 18, I was ca­su­ally imag­in­ing that the al­ign­ment prob­lem is a thing that some peo­ple have to solve while they’re build­ing an AGI in their lab while the rest of the world’s ig­nor­ing them. But if the thing which is ac­tu­ally hap­pen­ing is the world is go­ing in­sane around ev­ery­one, that’s a re­ally im­por­tant differ­ence.

Ro­hin Shah: I would say that this is prob­a­bly the most im­por­tant con­tested ques­tion in AI al­ign­ment right now. Some con­se­quences of it are in a grad­ual or con­tin­u­ous take­off world you ex­pect that by the time we get to sys­tems that can pose an ex­is­ten­tial risk. You’ve already had pretty smart sys­tems that have been de­ployed in the real world. They prob­a­bly had some failure modes. Whether or not we call them al­ign­ment failure modes or not is maybe not that im­por­tant. The point is peo­ple will be aware that AI sys­tems can fail in weird ways, de­pend­ing on what sorts of failures you ex­pect, you might ex­pect this to lead to more co­or­di­na­tion, more in­volve­ment in safety work. You might also be more op­ti­mistic about us­ing test­ing and en­g­ineer­ing styles of ap­proaches to the prob­lem which rely a bit more on trial and er­ror type of rea­son­ing be­cause you ac­tu­ally will get a chance to see er­rors be­fore they hap­pen at a su­per in­tel­li­gent ex­is­ten­tial risk caus­ing mode. There are lots of im­pli­ca­tions of this form that pretty rad­i­cally change which al­ign­ment plans you think are fea­si­ble.

Buck Sh­legeris: Also, it’s pretty rad­i­cally changed how op­ti­mistic you are about this whole AI al­ign­ment situ­a­tion, at the very least, peo­ple who are very op­ti­mistic about AI al­ign­ment caus­ing rel­a­tively small amounts of ex­is­ten­tial risk. A lot of the rea­son for this seems to be that they think that we’re go­ing to get these warn­ing shots where be­fore we have su­per­in­tel­li­gent AI, we have sub-hu­man level in­tel­li­gent AI with al­ign­ment failures like the cashier Ro­hin was talk­ing about ear­lier. And then peo­ple start car­ing about AI al­ign­ment a lot more. So op­ti­mism is also greatly af­fected by what you think about this.

I’ve ac­tu­ally been want­ing to ar­gue with peo­ple about this re­cently. I wrote a doc last night where I was ar­gu­ing that even in grad­ual take­off wor­lds, we should ex­pect a rea­son­ably high prob­a­bil­ity of doom if we can’t solve the AI al­ign­ment prob­lem. And I’m in­ter­ested to have this con­ver­sa­tion in more de­tail with peo­ple at some point. But yeah, I agree with what Ro­hin said.

Over­all on take­off speeds, I guess I still feel pretty un­cer­tain. It seems to me that cur­rently, what we can do with AI, like AI ca­pa­bil­ities are in­creas­ing con­sis­tently, and a lot of this comes from ap­ply­ing rel­a­tively non-mind­blow­ing al­gorith­mic ideas to larger amounts of com­pute and data. And I would be kind of sur­prised if you can’t ba­si­cally ride this wave away un­til you have trans­for­ma­tive AI. And so if I want to ar­gue that we’re go­ing to have fast take­offs, I kind of have to ar­gue that there’s some other ap­proach you can take which lets you build AI with­out hav­ing to go along that slow path, which also will hap­pen first. And I guess I think it’s kind of plau­si­ble that that is what’s go­ing to hap­pen. I think that’s what you’d have to ar­gue for if you want to ar­gue for a fast take off.

Ro­hin Shah: That all seems right to me. I’d be sur­prised if, out of nowhere, we saw a new AI ap­proach sud­denly started work­ing and over­took deep learn­ing. You also have to ar­gue that it then very quickly reaches hu­man level AI, which would be quite sur­pris­ing, right? In some sense, it would have to be some­thing com­pletely novel that we failed to think about in the last 60 years. We’re putting in way more effort now than we were in the last 60 years, but then counter coun­ter­point is that all of that ex­tra effort is go­ing straight into deep learn­ing. It’s not re­ally search­ing for com­pletely new paradigm-shift­ing ways to get to AGI.

Buck Sh­legeris: So here’s how I’d make that ar­gu­ment. Per­haps a re­ally im­por­tant in­put into a field like AI, is the num­ber of re­ally smart kids who have been want­ing to be an AI re­searcher since they were 16 be­cause they thought that it’s the most im­por­tant thing in the world. I think that in physics, a lot of the peo­ple who turn into physi­cists have ac­tu­ally wanted to be physi­cists for­ever. I think the num­ber of re­ally smart kids who wanted to be AI re­searchers for­ever has pos­si­bly gone up by a fac­tor of 10 over the last 10 years, it might even be more. And there just are prob­lems some­times, that are bot­tle necked on that kind of a thing, prob­a­bly. And so it wouldn’t be to­tally shock­ing to me, if as a re­sult of this par­tic­u­lar in­put to AI rad­i­cally in­creas­ing, we end up in kind of a differ­ent situ­a­tion. I haven’t quite thought through this ar­gu­ment fully.

Ro­hin Shah: Yeah, the ar­gu­ment seems plau­si­ble. There’s a large space of ar­gu­ments like this. I think even af­ter that, then I’ve started ques­tion­ing, “Okay, we get a new paradigm. The same ar­gu­ments ap­ply to that paradigm?” Not as strongly. I guess not the ar­gu­ments you were say­ing about com­pute go­ing up over time, but the ar­gu­ments given in the origi­nal slow take­off posts which were peo­ple quickly start tak­ing the low-hang­ing fruit and then move on. When there’s a lot of effort be­ing put into get­ting some prop­erty, you should ex­pect that easy low-hang­ing fruit is usu­ally just already taken, and that’s why you don’t ex­pect dis­con­ti­nu­ities. Un­less the new idea just im­me­di­ately rock­ets you to hu­man-level AGI, or x-risk caus­ing AGI, I think the same ar­gu­ment would pretty quickly start ap­ply­ing to that as well.

Buck Sh­legeris: I think it’s plau­si­ble that you do get rock­eted pretty quickly to hu­man-level AI. And I agree that this is an in­sane sound­ing claim.

Ro­hin Shah: Great. As long as we agree on that.

Buck Sh­legeris: Some­thing which has been on my to-do list for a while, and some­thing I’ve been do­ing a bit of and I’d be ex­cited for some­one else do­ing more of, is read­ing the his­tory of sci­ence and get­ting more of a sense of what kinds of things are bot­tle­necked by what, where. It could lead me to be a bit less con­fused about a bunch of this stuff. AI Im­pacts has done a lot of great work cat­a­loging all of the things that aren’t dis­con­tin­u­ous changes, which cer­tainly is a strong ev­i­dence to me against my claim here.

Lu­cas Perry: All right. What is the prob­a­bil­ity of AI-in­duced ex­is­ten­tial risk?

Ro­hin Shah: Un­con­di­tional on any­thing? I might give it 1 in 20. 5%.

Buck Sh­legeris: I’d give 50%.

Ro­hin Shah: I had a con­ver­sa­tion with AI Im­pacts that went into this in more de­tail and par­tially just an­chored on the num­ber I gave there, which was 10% con­di­tional on no in­ter­ven­tion from longter­mists, I think the broad ar­gu­ment is re­ally just the one that Buck and I were dis­agree­ing about ear­lier, which is to what ex­tent will so­ciety be in­cen­tivized to solve the prob­lem? There’s some chance that the first thing we try just works and we don’t even need to solve any sort of al­ign­ment prob­lem. It might just be fine. This is not im­plau­si­ble to me. Maybe that’s 30% or some­thing.

Most of the re­main­ing prob­a­bil­ity comes from, “Okay, the al­ign­ment prob­lem is a real prob­lem. We need to deal with it.” It might be very easy in which case we can just solve it straight away. That might be the case. That doesn’t seem that likely to me if it was a prob­lem at all. But what we will get is a lot of these warn­ing shots and peo­ple un­der­stand­ing the risks a lot more as we get more pow­er­ful AI sys­tems. This es­ti­mate is also con­di­tional on grad­ual take­off. I keep for­get­ting to say that, mostly be­cause I don’t know what prob­a­bil­ity I should put on dis­con­tin­u­ous take­off.

Lu­cas Perry: So is 5% with longter­mist in­ter­ven­tion, in­creas­ing to 10% if fast take­off?

Ro­hin Shah: Yes, but still with longter­mist in­ter­ven­tion. I’m pretty pes­simistic on fast take­off, but my prob­a­bil­ity as­signed to fast take­off is not very high. In a grad­ual take­off world, you get a lot of warn­ing shots. There will just gen­er­ally be aware­ness of the fact that the al­ign­ment prob­lem is a real thing and you won’t have the situ­a­tion you have right now of peo­ple say­ing this thing about wor­ry­ing about su­per­in­tel­li­gent AI sys­tems not do­ing what we want is to­tally bul­lshit. That won’t be a thing. Al­most ev­ery­one will not be say­ing that any­more, in the ver­sion where we’re right and there is a prob­lem. As a re­sult, peo­ple will not want to build AI sys­tems that are go­ing to kill them. Peo­ple tend to be pretty risk averse in my es­ti­ma­tion of the world, which Buck will prob­a­bly dis­agree with. And as a re­sult, you’ll get a lot of peo­ple try­ing to ac­tu­ally work on solv­ing the al­ign­ment prob­lem. There’ll be some amount of global co­or­di­na­tion which will give us more time to solve the al­ign­ment prob­lem than we may oth­er­wise have had. And to­gether, these forces mean that prob­a­bly we’ll be okay.

Buck Sh­legeris: So I think my dis­agree­ments with Ro­hin are ba­si­cally that I think fast take­offs are more likely. I ba­si­cally think there is al­most surely a prob­lem. I think that al­ign­ment might be difficult, and I’m more pes­simistic about co­or­di­na­tion. I know I said four things there, but I ac­tu­ally think of this as three dis­agree­ments. I want to say that “there isn’t ac­tu­ally a prob­lem” is just a kind of “al­ign­ment is re­ally easy to solve.” So then there’s three dis­agree­ments. One is grad­ual take­off, an­other is difficulty of solv­ing com­pet­i­tive pro­saic al­ign­ment, and an­other is how good we are at co­or­di­na­tion.

I haven’t ac­tu­ally writ­ten down these num­bers since I last changed my mind about a lot of the in­puts to them, so maybe I’m be­ing re­ally dumb. I guess, it feels to me that in fast take­off wor­lds, we are very sad un­less we have com­pet­i­tive al­ign­ment tech­niques, and so then we’re just only okay if we have these com­pet­i­tive al­ign­ment tech­niques. I guess I would say that I’m some­thing like 30% on us hav­ing good com­pet­i­tive al­ign­ment tech­niques by the time that it’s im­por­tant, which in­ci­den­tally is higher than Ro­hin I think.

Ro­hin Shah: Yeah, 30 is to­tally within the 25th to 75th in­ter­val on the prob­a­bil­ity, which is a weird thing to be re­port­ing. 30 might be my me­dian, I don’t know.

Buck Sh­legeris: To be clear, I’m not just in­clud­ing the outer al­ign­ment pro­por­tion here, which is what we were talk­ing about be­fore with IDA. I’m also in­clud­ing the in­ner al­ign­ment.

Ro­hin Shah: Yeah, 30% does seem a bit high. I think I’m a lit­tle more pes­simistic.

Buck Sh­legeris: So I’m like 30% that we can just solve the AI al­ign­ment prob­lem in this ex­cel­lent way, such that any­one who wants to can have very lit­tle ex­tra cost and then make AI sys­tems that are al­igned. I feel like in wor­lds where we did that, it’s pretty likely that things are rea­son­ably okay. I think that the grad­ual ver­sus fast take­off isn’t ac­tu­ally enor­mously much of a crux for me be­cause I feel like in wor­lds with­out com­pet­i­tive al­ign­ment tech­niques and grad­ual take­off, we still have a very high prob­a­bil­ity of doom. And I think that comes down to dis­agree­ments about co­or­di­na­tion. So maybe the main im­por­tant dis­agree­ment be­tween Ro­hin and I is, ac­tu­ally how well we’ll be able to co­or­di­nate, or how strongly in­di­vi­d­ual in­cen­tives will be for al­ign­ment.

Ro­hin Shah: I think there are other things. The rea­son I feel a bit more pes­simistic than you in the fast take­off world is just solv­ing prob­lems in ad­vance just is re­ally quite difficult and I re­ally like the abil­ity to be able to test tech­niques on ac­tual AI sys­tems. You’ll have to work with less pow­er­ful things. At some point, you do have to make the jump to more pow­er­ful things. But, still, be­ing able to test on the less pow­er­ful things, that’s so good, so much safety from there.

Buck Sh­legeris: It’s not ac­tu­ally clear to me that you get to test the most im­por­tant parts of your safety tech­niques. So I think that there are a bunch of safety prob­lems that just do not oc­cur on dog-level AIs, and do oc­cur on hu­man-level AI. If there are three lev­els of AI, there’s a thing which is as pow­er­ful as a dog, there’s a thing which is as pow­er­ful as a hu­man, and there’s a thing which is as pow­er­ful as a thou­sand John von Neu­manns. In grad­ual take­off world, you have a bunch of time in both of these two mile­stones, maybe. I guess it’s not su­per clear to me that you can use re­sults on less pow­er­ful sys­tems as that much ev­i­dence about whether your safety tech­niques work on dras­ti­cally more pow­er­ful sys­tems. It’s definitely some­what helpful.

Ro­hin Shah: It de­pends what you con­di­tion on in your differ­ence be­tween con­tin­u­ous take­off and dis­con­tin­u­ous take­off to say which one of them hap­pens faster. I guess the delta be­tween dog and hu­man is definitely longer in grad­ual take­off for sure. Okay, if that’s what you were say­ing, yep, I agree with that.

Buck Sh­legeris: Yeah, sorry, that’s all I meant.

Ro­hin Shah: Cool. One thing I wanted to ask is when you say dog-level AI as­sis­tant, do you mean some­thing like a neu­ral net that if put in a dog’s body re­plac­ing its brain would do about as well as a dog? Be­cause such a neu­ral net could then be put in other en­vi­ron­ments and learn to be­come re­ally good at other things, prob­a­bly su­per­hu­man at many things that weren’t in the an­ces­tral en­vi­ron­ment. Do you mean that sort of thing?

Buck Sh­legeris: Yeah, that’s what I mean. Dog-level AI is prob­a­bly much bet­ter than GPT2 at an­swer­ing ques­tions. I’m go­ing to define some­thing as dog-level AI, if it’s about as good as a dog at things which I think dogs are pretty heav­ily op­ti­mized for, like vi­sual pro­cess­ing or mo­tor con­trol in novel sce­nar­ios or other things like that, that I think dogs are pretty good at.

Ro­hin Shah: Makes sense. So I think in that case, plau­si­bly, dog-level AI already poses an ex­is­ten­tial risk. I can be­lieve that too.

Buck Sh­legeris: Yeah.

Ro­hin Shah: The AI cashier ex­am­ple feels like it could to­tally hap­pen prob­a­bly be­fore a dog-level AI. You’ve got all of the mo­ti­va­tion prob­lems already at that point of the game, and I don’t know what prob­lems you ex­pect to see be­yond then.

Buck Sh­legeris: I’m more talk­ing about whether you can test your solu­tions. I’m not quite sure how to say my in­tu­itions here. I feel like there are var­i­ous strate­gies which work for cor­ral­ling dogs and which don’t work for mak­ing hu­mans do what you want. In as much as your al­ign­ment strat­egy is aiming at a fla­vor of prob­lem that only oc­curs when you have su­per­hu­man things, you don’t get to test that ei­ther way. I don’t think this is a su­per im­por­tant point un­less you think it is. I guess I feel good about mov­ing on from here.

Ro­hin Shah: Mm-hmm (af­fir­ma­tive). Sounds good to me.

Lu­cas Perry: Okay, we’ve talked about what you guys have called grad­ual and fast take­off sce­nar­ios, or con­tin­u­ous and dis­con­tin­u­ous. Could you guys put some prob­a­bil­ities down on the like­li­hood of, and sto­ries that you have in your head, for fast and slow take­off sce­nar­ios?

Ro­hin Shah: That is a hard ques­tion. There are two sorts of rea­son­ing I do about prob­a­bil­ities. One is: use my in­ter­nal simu­la­tion of what­ever I’m try­ing to pre­dict, in­ter­nally simu­late what it looks like, whether it’s by my own mod­els, is it likely? How likely is it? At what point would I be will­ing to bet on it. Stuff like that. And then there’s a sep­a­rate ex­tra step where I’m like, “What do other peo­ple think about this? Oh, a lot of peo­ple think this thing that I as­signed one per­cent prob­a­bil­ity to is very likely. Hmm, I should prob­a­bly not be say­ing one per­cent then.” I don’t know how to do that sec­ond part for, well, most things but es­pe­cially in this set­ting. So I’m go­ing to just re­port Ro­hin’s model only, which will pre­dictably be un­der­stat­ing the prob­a­bil­ity for fast take­off in that if some­one from MIRI were to talk to me for five hours, I would prob­a­bly say a higher num­ber for the prob­a­bil­ity of fast take­off af­ter that, and I know that that’s go­ing to hap­pen. I’m just go­ing to ig­nore that fact and re­port my own model any­way.

On my own model, it’s some­thing like in wor­lds where AGI hap­pens soon, like in the next cou­ple of decades, then I’m like, “Man, 95% on grad­ual take off.” If it’s fur­ther away, like three to five decades, then I’m like, “Some things could have changed by then, maybe I’m 80%.” And then if it’s way off into the fu­ture and cen­turies, then I’m like, “Ah, maybe it’s 70%, 65%.” The rea­son it goes down over time is just be­cause it seems to me like if you want to ar­gue for dis­con­tin­u­ous take­off, you need to posit that there’s some paradigm change in how AI progress is hap­pen­ing and that seems more likely the fur­ther in the fu­ture you go.

Buck Sh­legeris: I feel kind of sur­prised that you get so low, like to 65% or 70%. I would have thought that those ar­gu­ments are a strong de­fault and then maybe at the mo­ment where in a po­si­tion that seems par­tic­u­larly grad­ual take­off-y, but I would have thought that you over time get to 80% or some­thing.

Ro­hin Shah: Yeah. Maybe my in­ter­nal model is like, “Holy shit, why do these MIRI peo­ple keep say­ing that dis­con­tin­u­ous take­off is so ob­vi­ous.” I agree that the ar­gu­ments in Paul’s posts feel very com­pel­ling to me and so maybe I should just be more con­fi­dent in them. I think say­ing 80%, even in cen­turies is plau­si­bly a cor­rect an­swer.

Lu­cas Perry: So, Ro­hin, is the view here that since com­pute is the thing that’s be­ing lev­er­aged to make most AI ad­vances that you would ex­pect that to be the mechanism by which that con­tinues to hap­pen in the fu­ture and we have some cer­tainty over how com­pute con­tinues to change into the fu­ture? Whereas things that would be lead­ing to a dis­con­tin­u­ous take­off would be world-shat­ter­ing, fun­da­men­tal in­sights into al­gorithms that would have pow­er­ful re­cur­sive self-im­prove­ment, which is some­thing you wouldn’t nec­es­sar­ily see if we just keep go­ing this lev­er­ag­ing com­pute route?

Ro­hin Shah: Yeah, I think that’s a pretty good sum­mary. Again, on the back­drop of the de­fault ar­gu­ment for this is peo­ple are re­ally try­ing to build AGI. It would be pretty sur­pris­ing if there is just this re­ally im­por­tant thing that ev­ery­one had just missed.

Buck Sh­legeris: It sure seems like in ma­chine learn­ing when I look at the things which have hap­pened over the last 20 years, all of them feel like the ideas are kind of ob­vi­ous or some­one else had pro­posed them 20 years ear­lier. Con­vNets were pro­posed 20 years be­fore they were good on ImageNet, and LSTMs were ages be­fore they were good for nat­u­ral lan­guage, and so on and so on and so on. Other sub­jects are not like this, like in physics some­times they just messed around for 50 years be­fore they knew what was hap­pen­ing. I don’t know, I feel con­fused how to feel about the fact that in some sub­jects, it feels like they just do sud­denly get bet­ter at things for rea­sons other than hav­ing more com­pute.

Ro­hin Shah: I think physics, at least, was of­ten bot­tle­necked by mea­sure­ments, I want to say.

Buck Sh­legeris: Yes, so this is one rea­son I’ve been in­ter­ested in his­tory of sci­ence re­cently, but there are cer­tainly a bunch of things. Peo­ple were in­ter­ested in chem­istry for a long time and it turns out that chem­istry comes from quan­tum me­chan­ics and you could, the­o­ret­i­cally, have guessed quan­tum me­chan­ics 70 years ear­lier than peo­ple did if you were smart enough. It’s not that com­pli­cated a hy­poth­e­sis to think of. Or rel­a­tivity is the clas­sic ex­am­ple of some­thing which could have been in­vented 50 years ear­lier. I don’t know, I would love to learn more about this.

Lu­cas Perry: Just to tie this back to the ques­tion, could you give your prob­a­bil­ities as well?

Buck Sh­legeris: Oh, geez, I don’t know. Hon­estly, right now I feel like I’m 70% grad­ual take­off or some­thing, but I don’t know. I might change my mind if I think about this for an­other hour. And there’s also the­o­ret­i­cal ar­gu­ments as well for why most take­offs are grad­ual, like the stuff in Paul’s post. The eas­iest sum­mary is, be­fore some­one does some­thing re­ally well, some­one else does it kind of well in cases where a lot of peo­ple are try­ing to do the thing.

Lu­cas Perry: Okay. One facet of this, that I haven’t heard dis­cussed, is re­cur­sive self-im­prove­ment, and I’m con­fused about where that be­comes the thing that af­fects whether it’s dis­con­tin­u­ous or con­tin­u­ous. If some­one does some­thing kind of well be­fore some­thing does some­thing re­ally well, if re­cur­sive self-im­prove­ment is a prop­erty of the thing be­ing done kind of well, is it just kind of self-im­prov­ing re­ally quickly, or?

Buck Sh­legeris: Yeah. I think Paul’s post does a great job of talk­ing about this ex­act ar­gu­ment. I think his ba­sic claim is, which I find pretty plau­si­ble, be­fore you have a sys­tem which is re­ally good at self-im­prov­ing, you have a sys­tem which is kind of good at self-im­prov­ing, if it turns out to be re­ally helpful to have a sys­tem be good at self-im­prov­ing. And as soon as this is true, you have to posit an ad­di­tional dis­con­ti­nu­ity.

Ro­hin Shah: One other thing I’d note is that hu­mans are to­tally self im­prov­ing. Pro­duc­tivity tech­niques, for ex­am­ple, are a form of self-im­prove­ment. You could imag­ine that AI sys­tems might have ad­van­tages that hu­mans don’t, like be­ing able to read their own weights and edit them di­rectly. How much of an ad­van­tage this gives to the AI sys­tem, un­clear. Still, I think then I just go back to the ar­gu­ment that Buck already made, which is at some point you get to an AI sys­tem that is some­what good at un­der­stand­ing its weights and figur­ing out how to edit them, and that hap­pens be­fore you get the re­ally pow­er­ful ones. Maybe this is like say­ing, “Well, you’ll reach hu­man lev­els of self-im­prove­ment by the time you have rat-level AI or some­thing in­stead of hu­man-level AI,” which ar­gues that you’ll hit this hy­per­bolic point of the curve ear­lier, but it still looks like a hy­per­bolic curve that’s still con­tin­u­ous at ev­ery point.

Buck Sh­legeris: I agree.

Lu­cas Perry: I feel just gen­er­ally sur­prised about your prob­a­bil­ities on con­tin­u­ous take­off sce­nar­ios that they’d be slow.

Ro­hin Shah: The rea­son I’m try­ing to avoid the word slow and fast is be­cause they’re mis­lead­ing. Slow take­off is not slow in cal­en­dar time rel­a­tive to fast take­off. The ques­tion is, is there a spike at some point? Some peo­ple, upon read­ing Paul’s posts are like, “Slow take­off is faster than fast take­off.” That’s a rea­son­ably com­mon re­ac­tion to it.

Buck Sh­legeris: I would put it as slow take­off is the claim that things are in­sane be­fore you have the hu­man-level AI.

Ro­hin Shah: Yeah.

Lu­cas Perry: This seems like a helpful per­spec­tive shift on this take­off sce­nario ques­tion. I have not read Paul’s post. What is it called so that we can in­clude it in the page for this pod­cast?

Ro­hin Shah: It’s just called Take­off Speeds. Then the cor­re­spond­ing AI Im­pacts post is called Will AI See Dis­con­tin­u­ous Progress?, I be­lieve.

Lu­cas Perry: So if each of you guys had a lot more reach and in­fluence and power and re­sources to bring to the AI al­ign­ment prob­lem right now, what would you do?

Ro­hin Shah: I get this ques­tion a lot and my re­sponse is always, “Man, I don’t know.” It seems hard to scal­ably use peo­ple right now for AI risk. I can talk about which ar­eas of re­search I’d like to see more peo­ple fo­cus on. If you gave me peo­ple where I’m like, “I trust your judg­ment on your abil­ity to do good con­cep­tual work” or some­thing, where would I put them? I think a lot of it would be on mak­ing good ro­bust ar­gu­ments for AI risk. I don’t think we re­ally have them, which seems like kind of a bad situ­a­tion to be in. I think I would also in­vest a lot more in hav­ing good in­tro­duc­tory ma­te­ri­als, like this re­view, ex­cept this re­view is a lit­tle more aimed at peo­ple who are already in the field. It is less aimed at peo­ple who are try­ing to en­ter the field. I think we just have pretty ter­rible re­sources for peo­ple com­ing into the field and that should change.

Buck Sh­legeris: I think that our re­sources are way bet­ter than they used to be.

Ro­hin Shah: That seems true.

Buck Sh­legeris: In the course of my work, I talk to a lot of peo­ple who are new to AI al­ign­ment about it and I would say that their level of in­formed­ness is dras­ti­cally bet­ter now than it was two years ago. A lot of which is due to things like 80,000 hours pod­cast, and other things like this pod­cast and the Align­ment Newslet­ter, and so on. I think we just have made it some­what eas­ier for peo­ple to get into ev­ery­thing. The Align­ment Fo­rum, hav­ing its se­quences promi­nently dis­played, and so on.

Ro­hin Shah: Yeah, you named liter­ally all of the things I would have named. Buck definitely has more in­for­ma­tion on this than I do. I do not work with peo­ple who are en­ter­ing the field as much. I do think we could be sub­stan­tially bet­ter.

Buck Sh­legeris: Yes. I feel like I do have ac­cess to re­sources, not di­rectly but in the sense that I know peo­ple at eg Open Philan­thropy and the EA Funds and if I thought there were ob­vi­ous things they should do, I think it’s pretty likely that those fun­ders would have already made them hap­pen. And I oc­ca­sion­ally em­bark on pro­jects my­self that I think are good for AI al­ign­ment, mostly on the out­reach side. On a few oc­ca­sions over the last year, I’ve just done pro­jects that I was op­ti­mistic about. So I don’t think I can name things that are just shovel-ready op­por­tu­ni­ties for some­one else to do, which is good news be­cause it’s mostly be­cause I think most of these things are already be­ing done.

I am en­thu­si­as­tic about work­shops. I help run with MIRI these AI Risks for Com­puter Scien­tists work­shops and I ran my own com­put­ing work­shop with some friends, with kind of a similar pur­pose, aimed at peo­ple who are in­ter­ested in this kind of stuff and who would like to spend some time learn­ing more about it. I feel op­ti­mistic about this kind of pro­ject as a way of do­ing the thing Ro­hin was say­ing, mak­ing it eas­ier for peo­ple to start hav­ing re­ally deep thoughts about a lot of AI al­ign­ment stuff. So that’s a kind of di­rec­tion of pro­jects that I’m pretty en­thu­si­as­tic about. A cou­ple other ran­dom AI al­ign­ment things I’m op­ti­mistic about. I’ve already men­tioned that I think there should be an Ought com­peti­tor just be­cause it seems like the kind of thing that more work could go into. I agree with Ro­hin on it be­ing good to have more con­cep­tual anal­y­sis of a bunch of this stuff. I’m gener­i­cally en­thu­si­as­tic about there be­ing more high qual­ity re­search done and more smart peo­ple, who’ve thought about this a lot, work­ing on it as best as they can.

Ro­hin Shah: I think the ac­tual bot­tle­neck is good re­search and not nec­es­sar­ily field build­ing, and I’m more op­ti­mistic about good re­search. Speci­fi­cally, I am par­tic­u­larly in­ter­ested in uni­ver­sal­ity, in­ter­pretabil­ity. I would love for there to be some way to give peo­ple who work on AI al­ign­ment the chance to step back and think about the high-level pic­ture for a while. I don’t know if peo­ple don’t do this be­cause they don’t want to or be­cause they don’t feel like they have the af­for­dance to do so, and I would like the af­for­dance to be there. I’d be very in­ter­ested in peo­ple build­ing mod­els of what AGI sys­tems could look like. Ex­pected util­ity max­i­miz­ers are one ex­am­ple of a model that you could have. Maybe we just try to redo evolu­tion. We just cre­ate a very com­pli­cated, di­verse en­vi­ron­ment with lots of agents go­ing around and in their multi-agent in­ter­ac­tion, they de­velop gen­eral in­tel­li­gence some­how. I’d be in­ter­ested for some­one to take that sce­nario, flesh it out more, and then talk about what the al­ign­ment prob­lem looks like in that set­ting.

Buck Sh­legeris: I would love to have some­one get re­ally knowl­edge­able about evolu­tion­ary biol­ogy and try and ap­ply analo­gies of that to AI al­ign­ment. I think that evolu­tion­ary biol­ogy has lots of smart things to say about what op­ti­miz­ers are and it’d be great to have those in­sights. I think Eliezer sort of did this many years ago. It would be good for more peo­ple to do this in my opinion.

Lu­cas Perry: All right. We’re in the home stretch here. AI timelines. What do you think about the cur­rent state of pre­dic­tions? There’s been sur­veys that have been done with peo­ple giv­ing maybe 50% prob­a­bil­ity over most re­searchers at about 2050 or so. What are each of your AI timelines? What’s your prob­a­bil­ity dis­tri­bu­tion look like? What do you think about the state of pre­dic­tions on this?

Ro­hin Shah: Haven’t looked at the state of pre­dic­tions in a while. It de­pends on who was sur­veyed. I think most peo­ple haven’t thought about it very much and I don’t know if I ex­pect their pre­dic­tions to be that good, but maybe wis­dom of the crowds is a real thing. I don’t think about it very much. I mostly use my in­side view and talk to a bunch of peo­ple. Maybe, me­dian, 30 years from now, which is 2050. So I guess I agree with them, don’t I? That feels like an ac­ci­dent. The sur­veys were not an in­put into this pro­cess.

Lu­cas Perry: Okay, Buck?

Buck Sh­legeris: I don’t know what I think my over­all timelines are. I think AI in the next 10 or 20 years is pretty plau­si­ble. Maybe I want to give it some­thing around 50% which puts my me­dian at around 2040. In terms of the state of things that peo­ple have said about AI timelines, I have had some re­ally great con­ver­sa­tions with peo­ple about their re­search on AI timelines which hasn’t been pub­lished yet. But at some point in the next year, I think it’s pretty likely that much bet­ter stuff about AI timelines mod­el­ing will have been pub­lished than has cur­rently been pub­lished, so I’m ex­cited for that.

Lu­cas Perry: All right. In­for­ma­tion haz­ards. Origi­nally, there seemed to be a lot of worry in the com­mu­nity about in­for­ma­tion haz­ards and even talk­ing about su­per­in­tel­li­gence and be­ing afraid of talk­ing to any­one in po­si­tions of power, whether they be in pri­vate in­sti­tu­tions or in gov­ern­ment, about the strate­gic ad­van­tage of AI, about how one day it may con­fer a de­ci­sive strate­gic ad­van­tage. The dis­so­nance here for me is that Putin comes out and says that who con­trols AI will con­trol the world. Nick Bostrom pub­lished Su­per­in­tel­li­gence, which ba­si­cally says what I already said. Max Teg­mark’s Life 3.0 ba­si­cally also. My ini­tial re­ac­tion and in­tu­ition is the cat’s out of the bag. I don’t think that echo­ing this in­creases risks any fur­ther than the risk is already at. But maybe you dis­agree.

Buck Sh­legeris: Yeah. So here are two opinions I have about info haz­ards. One is: how bad is it to say stuff like that all over the in­ter­net? My guess is it’s mildly bad be­cause I think that not ev­ery­one thinks those things. I think that even if you could get those opinions as con­se­quences from read­ing Su­per­in­tel­li­gence, I think that most peo­ple in fact have not read Su­per­in­tel­li­gence. Some­times there are ideas where I just re­ally don’t want them to be crys­tal­lized com­mon knowl­edge. I think that, to a large ex­tent, as­sum­ing grad­ual take­off wor­lds, it kind of doesn’t mat­ter be­cause AI sys­tems are go­ing to be rad­i­cally trans­form­ing the world in­evitably. I guess you can af­fect how gov­ern­ments think about it, but it’s a bit differ­ent there.

The other point I want to make about info haz­ards is I think there are a bunch of trick­i­nesses with AI safety, where think­ing about AI safety makes you think about ques­tions about how AI de­vel­op­ment might go. I think that think­ing about how AI de­vel­op­ment is go­ing to go oc­ca­sion­ally leads to think about things that are maybe, could be, rele­vant to ca­pa­bil­ities, and I think that this makes it hard to do re­search be­cause you then get scared about talk­ing about them.

Ro­hin Shah: So I think my take on this is info haz­ards are real in the sense that there, in fact, are costs to say­ing spe­cific kinds of in­for­ma­tion and pub­li­ciz­ing them a bit. I think I’ll agree in prin­ci­ple that some kinds of ca­pa­bil­ities in­for­ma­tion has the cost of ac­cel­er­at­ing timelines. I usu­ally think these are pretty strongly out­weighed by the benefits in that it just seems re­ally hard to be able to do any kind of shared in­tel­lec­tual work when you’re con­stantly wor­ried about what you do and don’t make pub­lic. It re­ally seems like if you re­ally want to build a shared un­der­stand­ing within the field of AI al­ign­ment, that benefit is worth say­ing things that might be bad in some other ways. This de­pends on a lot of back­ground facts that I’m not go­ing to cover here but, for ex­am­ple, I prob­a­bly wouldn’t say the same thing about bio se­cu­rity.

Lu­cas Perry: Okay. That makes sense. Thanks for your opinions on this. So at the cur­rent state in time, do you guys think that peo­ple should be en­gag­ing with peo­ple in gov­ern­ment or in policy spheres on ques­tions of AI al­ign­ment?

Ro­hin Shah: Yes, but not in the sense of we’re wor­ried about when AGI comes. Even say­ing things like it might be re­ally bad, as op­posed to say­ing it might kill ev­ery­body, seems not great. Mostly on the ba­sis of my model for what it takes to get gov­ern­ments to do things is, at the very least, you need con­sen­sus in the field so it seems kind of pointless to try right now. It might even be poi­son­ing the well for fu­ture efforts. I think it does make sense to en­gage with gov­ern­ment and poli­cy­mak­ers about things that are in fact prob­lems right now. To the ex­tent that you think that recom­mender sys­tems are caus­ing a lot of prob­lems, I think it makes sense to en­gage with gov­ern­ment about how al­ign­ment-like tech­niques can help with that, es­pe­cially if you’re do­ing a bunch of speci­fi­ca­tion learn­ing-type stuff. That seems like the sort of stuff that should have rele­vance to­day and I think it would be great if those of us who did speci­fi­ca­tion learn­ing were try­ing to use it to im­prove ex­ist­ing sys­tems.

Buck Sh­legeris: This isn’t my field. I trust the judg­ment of a lot of other peo­ple. I think that it’s plau­si­ble that it’s worth build­ing re­la­tion­ships with gov­ern­ments now, not that I know what I’m talk­ing about. I will note that I ba­si­cally have only seen peo­ple talk about how to do AI gov­er­nance in the cases where the AI safety prob­lem is 90th per­centile eas­iest. I ba­si­cally only see peo­ple talk­ing about it in the case where the tech­ni­cal safety prob­lem is pretty doable, and this con­cerns me. I’ve just never seen any­one talk about what you do in a world where you’re as pes­simistic as I am, ex­cept to com­pletely give up.

Lu­cas Perry: All right. Wrap­ping up here, is there any­thing else that we didn’t talk about that you guys think was im­por­tant? Or some­thing that we weren’t able to spend enough time on, that you would’ve liked to spend more time on?

Ro­hin Shah: I do want to even­tu­ally con­tinue the con­ver­sa­tion with Buck about co­or­di­na­tion, but that does seem like it should hap­pen not on this pod­cast.

Buck Sh­legeris: That’s what I was go­ing to say too. Some­thing that I want some­one to do is write a tra­jec­tory for how AI goes down, that is re­ally spe­cific about what the world GDP is in ev­ery one of the years from now un­til in­sane in­tel­li­gence ex­plo­sion. And just write down what the world is like in each of those years be­cause I don’t know how to write an in­ter­nally con­sis­tent, plau­si­ble tra­jec­tory. I don’t know how to write even one of those for any­thing ex­cept a ridicu­lously fast take­off. And this feels like a real shame.

Ro­hin Shah: That seems good to me as well. And also the sort of thing that I could not do be­cause I don’t know eco­nomics.

Lu­cas Perry: All right, so let’s wrap up here then. So if listen­ers are in­ter­ested in fol­low­ing ei­ther of you or see­ing more of your blog posts or places where you would recom­mend they read more ma­te­ri­als on AI al­ign­ment, where can they do that? We’ll start with you, Buck.

Buck Sh­legeris: You can Google me and find my web­site. I of­ten post things on the Effec­tive Altru­ism Fo­rum. If you want to talk to me about AI al­ign­ment in per­son, per­haps you should ap­ply to the AI Risks for Com­puter Scien­tists work­shops run by MIRI.

Lu­cas Perry: And Ro­hin?

Ro­hin Shah: I write the Align­ment Newslet­ter. That’s a thing that you could sign up for. Also on my web­site, if you Google Ro­hin Shah Align­ment Newslet­ter, I’m sure I will come up. Th­ese are also cross posted to the Align­ment Fo­rum, so an­other thing you can do is go to the Align­ment Fo­rum, look up my user­name and just see things that are there. I don’t know that this is ac­tu­ally the thing that you want to be do­ing. If you’re new to AI safety and want to learn more about it, I would echo the re­sources Buck men­tioned ear­lier, which are the 80k pod­casts about AI al­ign­ment. There are prob­a­bly on the or­der of five of these. There’s the Align­ment Newslet­ter. There are the three recom­mended se­quences on the Align­ment Fo­rum. Just go to al­ign­ment­fo­rum.org and look un­der recom­mended se­quences. And this pod­cast, of course.

Lu­cas Perry: All right. Heroic job, ev­ery­one. This is go­ing to be a re­ally good re­source, I think. It’s given me a lot of per­spec­tive on how think­ing has changed over the past year or two.

Buck Sh­legeris: And we can listen to it again in a year and see how dumb we are.

Lu­cas Perry: Yeah. There were lots of pre­dic­tions and prob­a­bil­ities given to­day, so it’ll be in­ter­est­ing to see how things are in a year or two from now. That’ll be great. All right, so cool. Thank you both so much for com­ing on.

End of recorded ma­te­rial