# [AN #80]: Why AI risk might be solved without additional intervention from longtermists

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).

Wel­come to an­other spe­cial edi­tion of the newslet­ter! In this edi­tion, I sum­ma­rize four con­ver­sa­tions that AI Im­pacts had with re­searchers who were op­ti­mistic that AI safety would be solved “by de­fault”. (Note that one of the con­ver­sa­tions was with me.)

While all four of these con­ver­sa­tions cov­ered very differ­ent top­ics, I think there were three main points of con­ver­gence. First, we were rel­a­tively un­con­vinced by the tra­di­tional ar­gu­ments for AI risk, and find dis­con­ti­nu­ities rel­a­tively un­likely. Se­cond, we were more op­ti­mistic about solv­ing the prob­lem in the fu­ture, when we know more about the prob­lem and have more ev­i­dence about pow­er­ful AI sys­tems. And fi­nally, we were more op­ti­mistic that as we get more ev­i­dence of the prob­lem in the fu­ture, the ex­ist­ing ML com­mu­nity will ac­tu­ally try to fix that prob­lem.

Con­ver­sa­tion with Paul Chris­ti­ano (Paul Chris­ti­ano, Asya Ber­gal, Ronny Fer­nan­dez, and Robert Long) (sum­ma­rized by Ro­hin): There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left (ETA: see this com­ment). So, the prior that any par­tic­u­lar thing has such an im­pact should be quite low. With AI in par­tic­u­lar, ob­vi­ously we’re go­ing to try to make AI sys­tems that do what we want them to do. So start­ing from this po­si­tion of op­ti­mism, we can then eval­u­ate the ar­gu­ments for doom. The two main ar­gu­ments: first, we can’t dis­t­in­guish ahead of time be­tween AIs that are try­ing to do the right thing, and AIs that are try­ing to kill us, be­cause the lat­ter will be­have nicely un­til they can ex­e­cute a treach­er­ous turn. Se­cond, since we don’t have a crisp con­cept of “do­ing the right thing”, we can’t se­lect AI sys­tems on whether they are do­ing the right thing.

How­ever, there are many “sav­ing throws”, or ways that the ar­gu­ment could break down, avoid­ing doom. Per­haps there’s no prob­lem at all, or per­haps we can cope with it with a lit­tle bit of effort, or per­haps we can co­or­di­nate to not build AIs that de­stroy value. Paul as­signs a de­cent amount of prob­a­bil­ity to each of these (and other) sav­ing throws, and any one of them suffices to avoid doom. This leads Paul to es­ti­mate that AI risk re­duces the ex­pected value of the fu­ture by roughly 10%, a rel­a­tively op­ti­mistic num­ber. Since it is so ne­glected, con­certed effort by longter­mists could re­duce it to 5%, mak­ing it still a very valuable area for im­pact. The main way he ex­pects to change his mind is from ev­i­dence from more pow­er­ful AI sys­tems, e.g. as we build more pow­er­ful AI sys­tems, per­haps in­ner op­ti­mizer con­cerns will ma­te­ri­al­ize and we’ll see ex­am­ples where an AI sys­tem ex­e­cutes a non-catas­trophic treach­er­ous turn.

Paul also be­lieves that clean al­gorith­mic prob­lems are usu­ally solv­able in 10 years, or prov­ably im­pos­si­ble, and early failures to solve a prob­lem don’t provide much ev­i­dence of the difficulty of the prob­lem (un­less they gen­er­ate proofs of im­pos­si­bil­ity). So, the fact that we don’t know how to solve al­ign­ment now doesn’t provide very strong ev­i­dence that the prob­lem is im­pos­si­ble. Even if the clean ver­sions of the prob­lem were im­pos­si­ble, that would sug­gest that the prob­lem is much more messy, which re­quires more con­certed effort to solve but also tends to be just a long list of rel­a­tively easy tasks to do. (In con­trast, MIRI thinks that pro­saic AGI al­ign­ment is prob­a­bly im­pos­si­ble.)

Note that even find­ing out that the prob­lem is im­pos­si­ble can help; it makes it more likely that we can all co­or­di­nate to not build dan­ger­ous AI sys­tems, since no one wants to build an un­al­igned AI sys­tem. Paul thinks that right now the case for AI risk is not very com­pel­ling, and so peo­ple don’t care much about it, but if we could gen­er­ate more com­pel­ling ar­gu­ments, then they would take it more se­ri­ously. If in­stead you think that the case is already com­pel­ling (as MIRI does), then you would be cor­re­spond­ingly more pes­simistic about oth­ers tak­ing the ar­gu­ments se­ri­ously and co­or­di­nat­ing to avoid build­ing un­al­igned AI.

One po­ten­tial rea­son MIRI is more doomy is that they take a some­what broader view of AI safety: in par­tic­u­lar, in ad­di­tion to build­ing an AI that is try­ing to do what you want it to do, they would also like to en­sure that when the AI builds suc­ces­sors, it does so well. In con­trast, Paul sim­ply wants to leave the next gen­er­a­tion of AI sys­tems in at least as good a situ­a­tion as we find our­selves in now, since they will be both bet­ter in­formed and more in­tel­li­gent than we are. MIRI has also pre­vi­ously defined al­igned AI as one that pro­duces good out­comes when run, which is a much broader con­cep­tion of the prob­lem than Paul has. But prob­a­bly the main dis­agree­ment be­tween MIRI and ML re­searchers and that ML re­searchers ex­pect that we’ll try a bunch of stuff, and some­thing will work out, whereas MIRI ex­pects that the prob­lem is re­ally hard, such that trial and er­ror will only get you solu­tions that ap­pear to work.

Ro­hin’s opinion: A gen­eral theme here seems to be that MIRI feels like they have very strong ar­gu­ments, while Paul thinks that they’re plau­si­ble ar­gu­ments, but aren’t ex­tremely strong ev­i­dence. Sim­ply hav­ing a lot more un­cer­tainty leads Paul to be much more op­ti­mistic. I agree with most of this.

How­ever, I do dis­agree with the point about “clean” prob­lems. I agree that clean al­gorith­mic prob­lems are usu­ally solved within 10 years or are prov­ably im­pos­si­ble, but it doesn’t seem to me like AI risk counts as a clean al­gorith­mic prob­lem: we don’t have a nice for­mal state­ment of the prob­lem that doesn’t rely on in­tu­itive con­cepts like “op­ti­miza­tion”, “try­ing to do some­thing”, etc. This sug­gests to me that AI risk is more “messy”, and so may re­quire more time to solve.

Con­ver­sa­tion with Ro­hin Shah (Ro­hin Shah, Asya Ber­gal, Robert Long, and Sara Hax­hia) (sum­ma­rized by Ro­hin): The main rea­son I am op­ti­mistic about AI safety is that we will see prob­lems in ad­vance, and we will solve them, be­cause no­body wants to build un­al­igned AI. A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here.

In ad­di­tion, many of the clas­sic ar­gu­ments for AI safety in­volve a sys­tem that can be de­com­posed into an ob­jec­tive func­tion and a world model, which I sus­pect will not be a good way to model fu­ture AI sys­tems. In par­tic­u­lar, cur­rent sys­tems trained by RL look like a grab bag of heuris­tics that cor­re­late well with ob­tain­ing high re­ward. I think that as AI sys­tems be­come more pow­er­ful, the heuris­tics will be­come more and more gen­eral, but they still won’t de­com­pose nat­u­rally into an ob­jec­tive func­tion, a world model, and search. In ad­di­tion, we can look at hu­mans as an ex­am­ple: we don’t fully pur­sue con­ver­gent in­stru­men­tal sub­goals; for ex­am­ple, hu­mans can be con­vinced to pur­sue differ­ent goals. This makes me more skep­ti­cal of tra­di­tional ar­gu­ments.

I would guess that AI sys­tems will be­come more in­ter­pretable in the fu­ture, as they start us­ing the fea­tures /​ con­cepts /​ ab­strac­tions that hu­mans are us­ing. Even­tu­ally, suffi­ciently in­tel­li­gent AI sys­tems will prob­a­bly find even bet­ter con­cepts that are alien to us, but if we only con­sider AI sys­tems that are (say) 10x more in­tel­li­gent than us, they will prob­a­bly still be us­ing hu­man-un­der­stand­able con­cepts. This should make al­ign­ment and over­sight of these sys­tems sig­nifi­cantly eas­ier. For sig­nifi­cantly stronger sys­tems, we should be del­e­gat­ing the prob­lem to the AI sys­tems that are 10x more in­tel­li­gent than us. (This is very similar to the pic­ture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been pub­lished and I was not aware of Chris’s views at the time of this con­ver­sa­tion.)

I’m also less wor­ried about race dy­nam­ics in­creas­ing ac­ci­dent risk than the me­dian re­searcher. The benefit of rac­ing a lit­tle bit faster is to have a lit­tle bit more power /​ con­trol over the fu­ture, while also in­creas­ing the risk of ex­tinc­tion a lit­tle bit. This seems like a bad trade from each agent’s per­spec­tive. (That is, the Nash equil­ibrium is for all agents to be cau­tious, be­cause the po­ten­tial up­side of rac­ing is small and the po­ten­tial down­side is large.) I’d be more wor­ried if [AI risk is real AND not ev­ery­one agrees AI risk is real when we have pow­er­ful AI sys­tems], or if the po­ten­tial up­side was larger (e.g. if rac­ing a lit­tle more made it much more likely that you could achieve a de­ci­sive strate­gic ad­van­tage).

Over­all, it feels like there’s around 90% chance that AI would not cause x-risk with­out ad­di­tional in­ter­ven­tion by longter­mists. The biggest dis­agree­ment be­tween me and more pes­simistic re­searchers is that I think grad­ual take­off is much more likely than dis­con­tin­u­ous take­off (and in fact, the first, third and fourth para­graphs above are quite weak if there’s a dis­con­tin­u­ous take­off). If I con­di­tion on dis­con­tin­u­ous take­off, then I mostly get very con­fused about what the world looks like, but I also get a lot more wor­ried about AI risk, es­pe­cially be­cause the “AI is to hu­mans as hu­mans are to ants” anal­ogy starts look­ing more ac­cu­rate. In the in­ter­view I said 70% chance of doom in this world, but with way more un­cer­tainty than any of the other cre­dences, be­cause I’m re­ally con­fused about what that world looks like. Two other dis­agree­ments, be­sides the ones above: I don’t buy Real­ism about ra­tio­nal­ity (AN #25), whereas I ex­pect many pes­simistic re­searchers do. I may also be more pes­simistic about our abil­ity to write proofs about fuzzy con­cepts like those that arise in al­ign­ment.

On timelines, I es­ti­mated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be us­ing “es­sen­tially cur­rent tech­niques” (which is ob­nox­iously hard to define). Con­di­tional on both of those, I es­ti­mated 70% chance that it would be some­thing like a mesa op­ti­mizer; mostly be­cause op­ti­miza­tion is a very use­ful in­stru­men­tal strat­egy for solv­ing many tasks, es­pe­cially be­cause gra­di­ent de­scent and other cur­rent al­gorithms are very weak op­ti­miza­tion al­gorithms (rel­a­tive to e.g. hu­mans), and so learned op­ti­miza­tion al­gorithms will be nec­es­sary to reach hu­man lev­els of sam­ple effi­ciency.

Ro­hin’s opinion: Look­ing over this again, I’m re­al­iz­ing that I didn’t em­pha­size enough that most of my op­ti­mism comes from the more out­side view type con­sid­er­a­tions: that we’ll get warn­ing signs that the ML com­mu­nity won’t ig­nore, and that the AI risk ar­gu­ments are not wa­ter­tight. The other parts are par­tic­u­lar in­side view dis­agree­ments that make me more op­ti­mistic, but they don’t fac­tor in much into my op­ti­mism be­sides be­ing ex­am­ples of how the meta con­sid­er­a­tions could play out. I’d recom­mend this com­ment of mine to get more of a sense of how the meta con­sid­er­a­tions fac­tor into my think­ing.

I was also glad to see that I still broadly agree with things I said ~5 months ago (since no ma­jor new op­pos­ing ev­i­dence has come up since then), though as I men­tioned above, I would now change what I place em­pha­sis on.

Con­ver­sa­tion with Robin Han­son (Robin Han­son, Asya Ber­gal, and Robert Long) (sum­ma­rized by Ro­hin): The main theme of this con­ver­sa­tion is that AI safety does not look par­tic­u­larly com­pel­ling on an out­side view. Progress in most ar­eas is rel­a­tively in­cre­men­tal and con­tin­u­ous; we should ex­pect the same to be true for AI, sug­gest­ing that timelines should be quite long, on the or­der of cen­turies. The cur­rent AI boom looks similar to pre­vi­ous AI booms, which didn’t amount to much in the past.

Timelines could be short if progress in AI were “lumpy”, as in a FOOM sce­nario. This could hap­pen if in­tel­li­gence was one sim­ple thing that just has to be dis­cov­ered, but Robin ex­pects that in­tel­li­gence is ac­tu­ally a bunch of not-very-gen­eral tools that to­gether let us do many things, and we sim­ply have to find all of these tools, which will pre­sum­ably not be lumpy. Most of the value from tools comes from more spe­cific, nar­row tools, and in­tel­li­gence should be similar. In ad­di­tion, the liter­a­ture on hu­man unique­ness sug­gests that it wasn’t “raw in­tel­li­gence” or small changes to brain ar­chi­tec­ture that makes hu­mans unique, it’s our abil­ity to pro­cess cul­ture (com­mu­ni­cat­ing via lan­guage, learn­ing from oth­ers, etc).

In any case, many re­searchers are now dis­tanc­ing them­selves from the FOOM sce­nario, and are in­stead ar­gu­ing that AI risk oc­curs due to stan­dard prin­ci­pal-agency prob­lems, in the situ­a­tion where the agent (AI) is much smarter than the prin­ci­pal (hu­man). Robin thinks that this doesn’t agree with the ex­ist­ing liter­a­ture on prin­ci­pal-agent prob­lems, in which losses from prin­ci­pal-agent prob­lems tend to be bounded, even when the agent is smarter than the prin­ci­pal.

You might think that since the stakes are so high, it’s worth work­ing on it any­way. Robin agrees that it’s worth hav­ing a few peo­ple (say a hun­dred) pay at­ten­tion to the prob­lem, but doesn’t think it’s worth spend­ing a lot of effort on it right now. Effort is much more effec­tive and use­ful once the prob­lem be­comes clear, or once you are work­ing with a con­crete de­sign; we have nei­ther of these right now and so we should ex­pect that most effort ends up be­ing in­effec­tive. It would be bet­ter if we saved our re­sources for the fu­ture, or if we spent time think­ing about other ways that the fu­ture could go (as in his book, Age of Em).

It’s es­pe­cially bad that AI safety has thou­sands of “fans”, be­cause this leads to a “cry­ing wolf” effect—even if the re­searchers have sub­tle, nu­anced be­liefs, they can­not con­trol the mes­sage that the fans con­vey, which will not be nu­anced and will in­stead con­fi­dently pre­dict doom. Then when doom doesn’t hap­pen, peo­ple will learn not to be­lieve ar­gu­ments about AI risk.

Ro­hin’s opinion: In­ter­est­ingly, I agree with al­most all of this, even though it’s (kind of) ar­gu­ing that I shouldn’t be do­ing AI safety re­search at all. The main place I dis­agree is that losses from prin­ci­pal-agent prob­lems with perfectly ra­tio­nal agents are bounded—this seems crazy to me, and I’d be in­ter­ested in spe­cific pa­per recom­men­da­tions (though note I and oth­ers have searched and not found many).

On the point about lump­iness, my model is that there are only a few un­der­ly­ing fac­tors (such as the abil­ity to pro­cess cul­ture) that al­low hu­mans to so quickly learn to do so many tasks, and al­most all tasks re­quire near-hu­man lev­els of these fac­tors to be done well. So, once AI ca­pa­bil­ities on these fac­tors reach ap­prox­i­mately hu­man level, we will “sud­denly” start to see AIs beat­ing hu­mans on many tasks, re­sult­ing in a “lumpy” in­crease on the met­ric of “num­ber of tasks on which AI is su­per­hu­man” (which seems to be the met­ric that peo­ple of­ten use, though I don’t like it, pre­cisely be­cause it seems like it wouldn’t mea­sure progress well un­til AI be­comes near-hu­man-level).

Con­ver­sa­tion with Adam Gleave (Adam Gleave et al) (sum­ma­rized by Ro­hin): Adam finds the tra­di­tional ar­gu­ments for AI risk un­con­vinc­ing. First, it isn’t clear that we will build an AI sys­tem that is so ca­pa­ble that it can fight all of hu­man­ity from its ini­tial po­si­tion where it doesn’t have any re­sources, le­gal pro­tec­tions, etc. While dis­con­tin­u­ous progress in AI could cause this, Adam doesn’t see much rea­son to ex­pect such dis­con­tin­u­ous progress: it seems like AI is pro­gress­ing by us­ing more com­pu­ta­tion rather than find­ing fun­da­men­tal in­sights. Se­cond, we don’t know how difficult AI safety will turn out to be; he gives a prob­a­bil­ity of ~10% that the prob­lem is as hard as (a car­i­ca­ture of) MIRI sug­gests, where any de­sign not based on math­e­mat­i­cal prin­ci­ples will be un­safe. This is es­pe­cially true be­cause as we get closer to AGI we’ll have many more pow­er­ful AI tech­niques that we can lev­er­age for safety. Thirdly, Adam does ex­pect that AI re­searchers will even­tu­ally solve safety prob­lems; they don’t right now be­cause it seems pre­ma­ture to work on those prob­lems. Adam would be more wor­ried if there were more arms race dy­nam­ics, or more em­piri­cal ev­i­dence or solid the­o­ret­i­cal ar­gu­ments in sup­port of spec­u­la­tive con­cerns like in­ner op­ti­miz­ers. He would be less wor­ried if AI re­searchers spon­ta­neously started to work on rel­a­tive prob­lems (more than they already do).

Adam makes the case for AI safety work differ­ently. At the high­est level, it seems pos­si­ble to build AGI, and some or­ga­ni­za­tions are try­ing very hard to build AGI, and if they suc­ceed it would be trans­for­ma­tive. That alone is enough to jus­tify some effort into mak­ing sure such a tech­nol­ogy is used well. Then, look­ing at the field it­self, it seems like the field is not cur­rently fo­cused on do­ing good sci­ence and en­g­ineer­ing to build safe, re­li­able sys­tems. So there is an op­por­tu­nity to have an im­pact by push­ing on safety and re­li­a­bil­ity. Fi­nally, there are sev­eral tech­ni­cal prob­lems that we do need to solve be­fore AGI, such as how we get in­for­ma­tion about what hu­mans ac­tu­ally want.

Adam also thinks that it’s 40-50% likely that when we build AGI, a PhD the­sis de­scribing it would be un­der­stand­able by re­searchers to­day with­out too much work, but ~50% that it’s some­thing rad­i­cally differ­ent. How­ever, it’s only 10-20% likely that AGI comes only from small vari­a­tions of cur­rent tech­niques (i.e. by vastly in­creas­ing data and com­pute). He would see this as more likely if we hit ad­di­tional mile­stones by in­vest­ing more com­pute and data (OpenAI Five was an ex­am­ple of such a mile­stone).

Ro­hin’s opinion: I broadly agree with all of this, with two main differ­ences. First, I am less wor­ried about some of the tech­ni­cal prob­lems that Adam men­tions, such as how to get in­for­ma­tion about what hu­mans want, or how to im­prove the ro­bust­ness of AI sys­tems, and more con­cerned about the more tra­di­tional prob­lem of how to cre­ate an AI sys­tem that is try­ing to do what you want. Se­cond, I am more bullish on the cre­ation of AGI us­ing small vari­a­tions on cur­rent tech­niques, but vastly in­creas­ing com­pute and data (I’d as­sign ~30%, while Adam as­signs 10-20%).

• There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left. So, the prior that any par­tic­u­lar thing has such an im­pact should be quite low.

I don’t fol­low this ar­gu­ment; I also checked the tran­script, and I still don’t see why I should buy it. Paul said:

A pri­ori you might’ve been like, well, if you’re go­ing to build some AI, you’re prob­a­bly go­ing to build the AI so it’s try­ing to do what you want it to do. Prob­a­bly that’s that. Plus, most things can’t de­stroy the ex­pected value of the fu­ture by 10%. You just can’t have that many things, oth­er­wise there’s not go­ing to be any value left in the end. In par­tic­u­lar, if you had 100 such things, then you’d be down to like 1/​1000th of your val­ues. 110 hun­dred thou­sandth? I don’t know, I’m not good at ar­ith­metic.

Any­way, that’s a pri­ori, just aren’t that many things are that bad and it seems like peo­ple would try and make AI that’s try­ing to do what they want.

In my words, the ar­gu­ment is “we agree that the fu­ture has non­triv­ial EV, there­fore big nega­tive im­pacts are a pri­ori un­likely”.

But why do we agree about this? Why are we as­sum­ing the fu­ture can’t be that bleak in ex­pec­ta­tion? I think there are good out­side-view ar­gu­ments to this effect, but that isn’t the rea­son­ing here.

• E.g. if you have a broad dis­tri­bu­tion over pos­si­ble wor­lds, some of which are “frag­ile” and have 100 things that cut value down by 10%, and some of which are “ro­bust” and don’t, then you get 10,000x more value from the ro­bust wor­lds. So un­less you are a pri­ori pretty con­fi­dent that you are in a frag­ile world (or they are 10,000x more valuable, or what­ever), the ro­bust wor­lds will tend to dom­i­nate.

Similar ar­gu­ments work if we ag­gre­gate across pos­si­ble paths to achiev­ing value within a fixed, known world—if there are sev­eral ways things can go well, some of which are more ro­bust, those will drive al­most all of the EV. And similarly for moral un­cer­tainty (if there are sev­eral plau­si­ble views, the ones that con­sider this world a lost cause will in­stead spend their in­fluence on other wor­lds) and so forth. I think it’s a rea­son­ably ro­bust con­clu­sion across many differ­ent frame­works: your de­ci­sion shouldn’t end up be­ing dom­i­nated by some hugely con­junc­tive event.

• I’m more un­cer­tain about this one, but I be­lieve that a sep­a­rate prob­lem with this an­swer is that it’s an ar­gu­ment about where value comes from, not an ar­gu­ment about what is prob­a­ble. Let’s sup­pose 50% of all wor­lds are frag­ile and 50% are ro­bust. If most of the things that de­stroy a world are due to emerg­ing tech­nol­ogy, then we still have similar amounts of both wor­lds around right now (or similar mea­sure on both classes if they’re in­finite many, or what­ever). So it’s not a rea­son to sus­pect a non-frag­ile world right now.

• Yes, but the fact that the frag­ile wor­lds are much more likely to end in the fu­ture is a rea­son to con­di­tion your efforts on be­ing in a ro­bust world.

While I do buy Paul’s ar­gu­ment, I think it’d be very helpful if the var­i­ous sum­maries of the in­ter­views with him were ed­ited to make it clear that he’s talk­ing about value-con­di­tioned prob­a­bil­ities rather than un­con­di­tional prob­a­bil­ities—since the claim as origi­nally stated feels mis­lead­ing. (Even if some de­ci­sion the­o­ries only use the former, most peo­ple think in terms of the lat­ter).

• value-con­di­tioned probabilities

Is this a thing or some­thing you just coined? “Prob­a­bil­ity” has a mean­ing, I’m to­tally against us­ing it for things that aren’t that.

I get why the ar­gu­ment is valid for de­cid­ing what we should do – and you could ar­gue that’s the only im­por­tant thing. But it doesn’t make it more likely that our world is ro­bust, which is what the post was claiming. It’s not about prob­a­bil­ity, it’s about EV.

• This ar­gu­ment seems to point at some ex­tremely im­por­tant con­sid­er­a­tions in the vicinity of “we should act ac­cord­ing to how we want civ­i­liza­tions similar to us to act” (rather than just fo­cus­ing on causally in­fluenc­ing our fu­ture light cone), etc.

The de­tails of the dis­tri­bu­tion over pos­si­ble wor­lds that you use here seem to mat­ter a lot. How ro­bust are the “ro­bust wor­lds”? If they are max­i­mally ro­bust (i.e. things turn out great with prob­a­bil­ity 1 no mat­ter what the civ­i­liza­tion does) then we should as­sign zero weight to the prospect of be­ing in a “ro­bust world”, and place all our chips on be­ing in a “frag­ile world”.

Con­trar­ily, if the dis­tri­bu­tion over pos­si­ble wor­lds as­signs suffi­cient prob­a­bil­ity to wor­lds in which there is a sin­gle very risky thing that cuts EV down by ei­ther 10% or 90% de­pend­ing on whether the civ­i­liza­tion takes it se­ri­ously or not, then per­haps such wor­lds should dom­i­nate our de­ci­sion mak­ing.

• E.g. if you have a broad dis­tri­bu­tion over pos­si­ble wor­lds, some of which are “frag­ile” and have 100 things that cut value down by 10%, and some of which are “ro­bust” and don’t, then you get 10,000x more value from the ro­bust wor­lds. So un­less you are a pri­ori pretty con­fi­dent that you are in a frag­ile world (or they are 10,000x more valuable, or what­ever), the ro­bust wor­lds will tend to dom­i­nate.

This is only true if you as­sume that there is an equal num­ber of ro­bust and frag­ile wor­lds out there, and your un­cer­tainty is strictly ran­dom, i.e. you’re un­cer­tain about which of those wor­lds you live in.

I’m not su­per con­fi­dent that our world is frag­ile, but I sus­pect that most wor­lds look the same. I.e., maybe 99.99% of wor­lds are ro­bust, maybe 99.99% are frag­ile. If it’s the lat­ter, then I prob­a­bly live in a frag­ile world.

• If it’s a 50% chance that 99.99% of wor­lds are ro­bust and 50% chance that 99.99% are frag­ile, then the vast ma­jor­ity of EV comes from the first op­tion where the vast ma­jor­ity of wor­lds are ro­bust.

• You’re right, the na­ture of un­cer­tainty doesn’t ac­tu­ally mat­ter for the EV. My bad.

• A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here.

I’d be in­ter­ested to hear a bit more about your po­si­tion on this.

I’m go­ing to ar­gue for the “ap­ply­ing bandaid fixes that don’t scale” po­si­tion for a sec­ond. To me, it seems that there’s a strong cul­ture in ML of “ap­ply ran­dom fixes un­til some­thing looks like it works” and then just rol­ling with what­ever comes out of that al­gorithm.

I’ll draw at­ten­tion to image mod­el­ling to illus­trate what I’m point­ing at. Up un­til about 2014, the main met­ric for eval­u­at­ing an image qual­ity was the bayesian nega­tive log likely­hood. As far as I can tell, this goes all the way back to at least “To Rec­og­nize Shapes, First Learn to Gen­er­ate Images” Where the CD al­gorithm acts to min­i­mize the log like­li­hood of the data. This can be seen in the VAE pa­per and also the origi­nal GAN pa­per. How­ever, af­ter GANs be­came pop­u­lar, the log likely­hood met­ric seemed to have gone out the win­dow. The GANs made re­ally com­pel­ling images. Due to the difficulty of eval­u­at­ing NLL, peo­ple in­vented new met­rics. ID and FID were used to as­sess the qual­ity of the gen­er­ated images. I might be wrong, but I think it took a while af­ter that for peo­ple to re­al­ize that SOTA GANs we’re get­ting ter­rible NNLs com­pared to SOTA VAEs, even though the VAE’s gen­er­ated images that we’re sig­nifi­cantly blur­rier/​nois­ier. It also be­came ob­vi­ous that GANs were drop­ping modes of the dis­tri­bu­tion, effec­tively failing to model en­tire classes of images.

As far as I can, tell there’s been a lot of work to get GANs to model all image modes. The most salient and re­cent would be Deep­Minds PresGAN . Where they clearly show the is­sue and how PresGAN solves it in Figure 1. How­ever, look­ing at table 5, there’s still a huge gap be­tween in NLL be­tween PresGAN and VAEs. It seems to me that most of the at­tempt to solve this is­sue are very similar to “bandaid fixes that don’t scale” in the sense that they mostly feel like hacks. None of them re­ally ad­dress the gap in likely­hood be­tween VAEs and GANs.

I’m wor­ried that a similar story could hap­pen with AI safety. A prob­lem arises and gets swept un­der the rug for a bit. Later, it’s re­dis­cov­ered and be­comes com­mon knowl­edge. Then, in­stead of solv­ing it be­fore mov­ing for­ward, we see mas­sive in­creases in ca­pa­bil­ities. Si­mul­ta­neously, the prob­lem is at most ad­dressed with hacks that don’t re­ally solve the prob­lem, or solve it just enough to pre­vent the in­crease in ca­pa­bil­ities from be­com­ing ob­vi­ously un­jus­tified.

• To me, it seems that there’s a strong cul­ture in ML of “ap­ply ran­dom fixes un­til some­thing looks like it works” and then just rol­ling with what­ever comes out of that al­gorithm.

I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world.

I’ll draw at­ten­tion to image mod­el­ling to illus­trate what I’m point­ing at. [...] It also be­came ob­vi­ous that GANs were drop­ping modes of the dis­tri­bu­tion, effec­tively failing to model en­tire classes of images. [...] None of them re­ally ad­dress the gap in likely­hood be­tween VAEs and GANs.

Per­haps peo­ple are op­ti­miz­ing for “mak­ing pretty pic­tures” in­stead of “nega­tive log like­li­hood”. I wouldn’t be sur­prised if for many ap­pli­ca­tions of GANs, di­ver­sity of images is not ac­tu­ally that im­por­tant, and what you re­ally want is that the few images you do gen­er­ate look re­ally good. In that case, it makes com­plete sense to push pri­mar­ily on GANs, and while you try to ad­dress mode col­lapse, when faced with a trade­off you choose GANs over VAEs any­way.

I’m wor­ried that a similar story could hap­pen with AI safety. A prob­lem arises and gets swept un­der the rug for a bit.

Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug?

As­sum­ing your an­swer is no, then it seems like whether a prob­lem gets swept un­der the rug de­pends on par­tic­u­lar em­piri­cal con­sid­er­a­tions, such as:

• How bad it would be if the prob­lem was real (the mag­ni­tude of the down­side). This could be eval­u­ated with re­spect to so­ciety and to the in­di­vi­d­ual agents de­cid­ing whether or not to de­ploy the po­ten­tially prob­le­matic AI.

• How com­pel­ling the ev­i­dence is that the prob­lem is real.

I tend to think that ex­ist­ing prob­lems with AI are not that bad (though in most cases ob­vi­ously quite real), while long-term con­cerns about AI would be very bad, but are not ob­vi­ously real. If the long-term con­cerns are real, we should get more ev­i­dence about them in the fu­ture, and then we’ll have a prob­lem that is both very bad and (more) clearly real, and that’s when I ex­pect that it will be taken se­ri­ously.

Con­sider e.g. fair­ness and bias. No­body thinks that the prob­lem is solved. Peo­ple do con­tinue to de­ploy un­fair and bi­ased AI sys­tems, but that’s be­cause the down­side of un­fair and bi­ased AI sys­tems is smaller in mag­ni­tude than the up­side of us­ing the AI sys­tems in the first place—they aren’t be­ing de­ployed be­cause peo­ple think they have “solved the prob­lem”.

• I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world.

[...]

Per­haps peo­ple are op­ti­miz­ing for “mak­ing pretty pic­tures” in­stead of “nega­tive log like­li­hood”. I wouldn’t be sur­prised if for many ap­pli­ca­tions of GANs, di­ver­sity of images is not ac­tu­ally that im­por­tant, and what you re­ally want is that the few images you do gen­er­ate look re­ally good. In that case, it makes com­plete sense to push pri­mar­ily on GANs, and while you try to ad­dress mode col­lapse, when faced with a trade­off you choose GANs over VAEs any­way.

This is fair. How­ever, the point of the ex­am­ple is more that mode drop­ping and bad NLL were not no­ticed when peo­ple started op­ti­miz­ing GANs for image qual­ity. As far as I can tell, it took a while for in­di­vi­d­u­als to no­tice, longer for it to be­come com­mon knowl­edge, and even more time for any­one to do any­thing about it. Even now, the “solu­tions” are hacks that don’t com­pletely re­solve the is­sue.

There was a large win­dow of time where a prac­ti­tioner could im­ple­ment a GAN ex­pect­ing it to cover all the modes. If there was a world where failing to cover all the modes of the dis­tri­bu­tion lead to large nega­tive con­se­quences, the failure would prob­a­bly have gone un­no­ticed un­til it was too late.

Here’s a real ex­am­ple. This is the NTSB crash re­port for the Uber au­tonomous ve­hi­cle that kil­led a pedes­trian. Some­one should prob­a­bly do an in depth anal­y­sis of the whole thing, but for now I’ll draw your at­ten­tion to sec­tion 1.6.2. Hazard Avoidance and Emer­gency Brak­ing. In it they say:

When the sys­tem de­tects an emer­gency situ­a­tion, it ini­ti­ates ac­tion sup­pres­sion. This is a one-sec­ond pe­riod dur­ing which the ADS sup­presses planned brak­ing while the (1) sys­tem ver­ifies the na­ture of the de­tected haz­ard and calcu­lates an al­ter­na­tive path, or (2) ve­hi­cle op­er­a­tor takes con­trol of the ve­hi­cle. ATG stated that it im­ple­mented ac­tion sup­pres­sion pro­cess due to the con­cerns of the de­vel­op­men­tal ADS iden­ti­fy­ing false alarms—de­tec­tion of a haz­ardous situ­a­tion when none ex­ists—caus­ing the ve­hi­cle to en­gage in un­nec­es­sary ex­treme ma­neu­vers.

[...]

if the col­li­sion can­not be avoided with the ap­pli­ca­tion of the max­i­mum al­lowed brak­ing, the sys­tem is de­signed to provide an au­di­tory warn­ing to the ve­hi­cle op­er­a­tor while si­mul­ta­neously ini­ti­at­ing grad­ual ve­hi­cle slow­down. In such cir­cum­stance, ADS would not ap­ply the max­i­mum brak­ing to only miti­gate the col­li­sion.

This strikes me as a “ran­dom fix” where the core is­sue was that the sys­tem did not have suffi­cient dis­crim­i­na­tory power to tell apart a safe situ­a­tion from an un­safe situ­a­tion. In­stead of prop­erly solv­ing this prob­lem, the re­searchers put in a hack.

Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug?

I agree that we shouldn’t be wor­ried about situ­a­tions where there is a clear threat. But that’s not quite the class of failures that I’m wor­ried about. Fair­ness, bias, and ad­ver­sar­ial ex­am­ples are all closer to what I’m get­ting at. The gen­eral pat­tern is that ML re­searchers hack to­gether a sys­tem that works, but has some prob­lems they’re un­aware of. Later, the prob­lems are dis­cov­ered and the re­ac­tion is to hack to­gether a solu­tion. This is pretty much the op­po­site of the safety mind­set EY was talk­ing about. It leaves room for catas­tro­phe in the ini­tial win­dow when the prob­lem goes un­de­tected, and in­definitely af­ter­wards if the hack is in­suffi­cient to deal with the is­sue.

More speci­fi­cally, I’m wor­ried about a situ­a­tion where at some point dur­ing grad stu­dent de­cent some­one says, “That’s funny...” then goes on to pub­lish their work. Later, some­one else de­ploys their idea plus 3 or­ders of mag­ni­tude more com­put­ing power and we all die. That, or we don’t all die. In­stead we re­solve the is­sue with a hack. Then a cou­ple bumps in com­put­ing power and ca­pa­bil­ities later we all die.

The above comes across as both para­noid and far­feched, and I’m not sure the AI com­mu­nity will take on the re­quired level of cau­tion to pre­vent it un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI. Nu­clear re­ac­tor de­sign is the only do­main I know of where peo­ple are close to suffi­ciently para­noid.

• I’m not sure the AI com­mu­nity will take on the re­quired level of cau­tion to pre­vent it un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI.

Im­por­tant thing to re­mem­ber is that Ro­hin is ex­plic­itly talk­ing about a non-foom sce­nario, so the as­sump­tion is that hu­man­ity would sur­vive AI-Ch­er­nobyl.

• My worry is less that we wouldn’t sur­vive AI-Ch­er­nobyl as much as it is that we won’t get an AI-Ch­er­nobyl.

I think that this is where there’s a differ­ence in mod­els. Even in a non-FOOM sce­nario I’m hav­ing a hard time en­vi­sion­ing a world where the gap in ca­pa­bil­ities be­tween AI-Ch­er­nobyl and global catas­trophic UFAI is that large. I used Ch­er­nobyl as an ex­am­ple be­cause it scared the pub­lic and the in­dus­try into mak­ing things very safe. It had a lot go­ing for it to make that hap­pen. Ra­di­a­tion is in­visi­ble and hurts you by ei­ther kil­ling you in­stantly, mak­ing your skin fall off, or giv­ing you can­cer and birth defects. The dis­aster was also ex­tremely ex­pen­sive, with the to­tal costs on the or­der of 10^11 USD$. If a defec­tive AI sys­tem man­ages to do some­thing that in­stils the same level of fear into re­searchers and the pub­lic as Ch­er­nobyl did, I would ex­pect that we were on the cusp of build­ing sys­tems that we couldn’t con­trol at all. If I’m right and the gap be­tween those two events is small, then there’s a sig­nifi­cant risk that noth­ing will hap­pen in that win­dow. We’ll get plenty of warn­ings that won’t be suffi­cient to in­stil the nec­es­sary level of cau­tion into the com­mu­nity, and later down the road we’ll find our­selves in a situ­a­tion we can’t re­cover from. • My im­pres­sion is that peo­ple work­ing on self-driv­ing cars are in­cred­ibly safety-con­scious, be­cause the risks are very salient. I don’t think AI-Ch­er­nobyl has to be a Ch­er­nobyl level dis­aster, just some­thing that makes the risks salient. E.g. per­haps an el­der care AI robot pre­tends that all of its pa­tients are fine in or­der to pre­serve its ex­is­tence, and this leads to a death and is then dis­cov­ered. If hos­pi­tals let AI al­gorithms make de­ci­sions about drugs ac­cord­ing to com­pli­cated re­ward func­tions, I would ex­pect this to hap­pen with cur­rent ca­pa­bil­ities. (It’s no­table to me that this doesn’t already hap­pen, given the in­sane hype around AI.) • My im­pres­sion is that peo­ple work­ing on self-driv­ing cars are in­cred­ibly safety-con­scious, be­cause the risks are very salient. Safety con­scious peo­ple work­ing on self driv­ing cars don’t pro­gram their cars to not take eva­sive ac­tion af­ter de­tect­ing that a col­li­sion is im­mi­nent. (It’s no­table to me that this doesn’t already hap­pen, given the in­sane hype around AI.) I think it already has.(It was for ex­tra care, not drugs, but it’s a clear cut case of a mis­speci­fied ob­jec­tive func­tion lead­ing to sub­op­ti­mal de­ci­sions for a mul­ti­tude of in­di­vi­d­u­als.) I’ll note, per­haps un­fairly, that the fact that this study was not salient enough to make it to your at­ten­tion even with a cul­ture war sig­nal boost is ev­i­dence that it needs to be a Ch­er­nobyl level event. • I agree that Tesla does not seem very safety con­scious (but it’s no­table that they are still safer than hu­man drivers in terms of fatal­ities per mile, if I re­mem­ber cor­rectly?) I think it already has. Huh, what do you know. Faced with an ac­tual ex­am­ple, I’m re­al­iz­ing that what I ac­tu­ally ex­pect would cause peo­ple to take it more se­ri­ously is a) the be­lief that AGI is near and b) an ex­am­ple where the AI al­gorithm “de­liber­ately” causes a prob­lem (i.e. “with full knowl­edge” that the thing it was do­ing was not what we wanted). I think most deep RL re­searchers already be­lieve that re­ward hack­ing is a thing (which is what that study shows). even with a cul­ture war sig­nal boost Tan­gen­tial, but that makes it less likely that I read it; I try to com­pletely ig­nore any­thing with the term “racial bias” in its ti­tle un­less it’s di­rectly per­ti­nent to me. (Be­ing about AI isn’t enough to make it per­ti­nent to me.) • Faced with an ac­tual ex­am­ple, I’m re­al­iz­ing that what I ac­tu­ally ex­pect would cause peo­ple to take it more se­ri­ously is a) the be­lief that AGI is near and b) an ex­am­ple where the AI al­gorithm “de­liber­ately” causes a prob­lem (i.e. “with full knowl­edge” that the thing it was do­ing was not what we wanted). What do you ex­pect the ML com­mu­nity to do at that point? Co­or­di­nate to stop or slow down the race to AGI un­til AI safety/​al­ign­ment is solved? Or do you think each com­pany/​lab will unilat­er­ally in­vest more into safety/​al­ign­ment with­out slow­ing down ca­pa­bil­ity re­search much, and that will be suffi­cient? Or some­thing else? I worry about a par­allel with the “en­ergy com­mu­nity”, a large part of which not just ig­nores but ac­tively tries to ob­scure or down­play warn­ing signs about fu­ture risks as­so­ci­ated with cer­tain forms of en­ergy pro­duc­tion. Given that the run-up to AGI will likely gen­er­ate huge prof­its for AI com­pa­nies as well as provide clear benefits for many peo­ple (com­pared to which, the dis­asters that will have oc­curred by then may well seem tol­er­able by com­par­i­son), and given prob­a­ble dis­agree­ments be­tween differ­ent ex­perts about how se­ri­ous the fu­ture risks are, it seems likely to me that AI risk will be­come poli­ti­cized/​con­tro­ver­sial in a way similar to cli­mate change, which will pre­vent effec­tive co­or­di­na­tion around it. On the other hand… maybe AI will be more like nu­clear power than fos­sil fuels, and a few big ac­ci­dents will stall its de­ploy­ment for quite a while. Is this why you’re rel­a­tively op­ti­mistic about AI risk be­ing taken se­ri­ously, and if so can you share why you think nu­clear power is a closer anal­ogy? • What do you ex­pect the ML com­mu­nity to do at that point? It de­pends a lot on the par­tic­u­lar warn­ing shot that we get. But on the strong ver­sions of warn­ing shots, where there’s com­mon knowl­edge that build­ing an AGI runs a sub­stan­tial risk of de­stroy­ing the world, yes, I ex­pect them to not build AGI un­til safety is solved. (Not to the stan­dard you usu­ally imag­ine, where we must also solve philo­soph­i­cal prob­lems, but to the stan­dard I usu­ally imag­ine, where the AGI is not try­ing to de­ceive us or work against us.) This de­pends on other back­ground fac­tors, e.g. how much the var­i­ous ac­tors think they are value-al­igned vs. in zero-sum com­pe­ti­tion. I cur­rently think the ML com­mu­nity thinks they are mostly but not fully value-al­igned, and they will in­fluence com­pa­nies and gov­ern­ments in that di­rec­tion. (I also want more longter­mists to be try­ing to build more com­mon knowl­edge of how much hu­mans are value al­igned, to make this more likely.) I worry about a par­allel with the “en­ergy com­mu­nity” The ma­jor dis­anal­ogy is that catas­trophic out­comes of cli­mate change do not per­son­ally af­fect the CEOs of en­ergy com­pa­nies very much, whereas AI x-risk af­fects ev­ery­one. (Also, maybe we haven’t got­ten clear and ob­vi­ous warn­ing shots?) (com­pared to which, the dis­asters that will have oc­curred by then may well seem tol­er­able by com­par­i­son), and given prob­a­ble dis­agree­ments be­tween differ­ent ex­perts about how se­ri­ous the fu­ture risks are I agree that my story re­quires com­mon knowl­edge of the risk of build­ing AGI, in the sense that you need peo­ple to pre­dict “run­ning this code might lead to all hu­mans dy­ing”, and not “run­ning this code might lead to <warn­ing shot effect>”. You also need rel­a­tive agree­ment on the risks. I think this is pretty achiev­able. Most of the ML com­mu­nity already agrees that build­ing an AGI is high-risk if not done with some ar­gu­ment for safety. The thing peo­ple tend to dis­agree on is when we will get AGI and how much we should work on safety be­fore then. • But on the strong ver­sions of warn­ing shots, where there’s com­mon knowl­edge that build­ing an AGI runs a sub­stan­tial risk of de­stroy­ing the world, yes, I ex­pect them to not build AGI un­til safety is solved. (Not to the stan­dard you usu­ally imag­ine, where we must also solve philo­soph­i­cal prob­lems, but to the stan­dard I usu­ally imag­ine, where the AGI is not try­ing to de­ceive us or work against us.) To the ex­tent that we ex­pect strong warn­ing shots and abil­ity to avoid build­ing AGI upon re­ceiv­ing such warn­ing shots, this seems like an ar­gu­ment for re­searchers/​longter­mists to work on /​ ad­vo­cate for safety prob­lems be­yond the stan­dard of “AGI is not try­ing to de­ceive us or work against us” (be­cause that stan­dard will likely be reached any­way). Do you agree? The ma­jor dis­anal­ogy is that catas­trophic out­comes of cli­mate change do not per­son­ally af­fect the CEOs of en­ergy com­pa­nies very much, whereas AI x-risk af­fects ev­ery­one. Some types of AI x-risk don’t af­fect ev­ery­one though (e.g., ones that re­duce the long term value of the uni­verse or mul­ti­verse with­out kil­ling ev­ery­one in the near term). • To the ex­tent that we ex­pect strong warn­ing shots and abil­ity to avoid build­ing AGI upon re­ceiv­ing such warn­ing shots, this seems like an ar­gu­ment for re­searchers/​longter­mists to work on /​ ad­vo­cate for safety prob­lems be­yond the stan­dard of “AGI is not try­ing to de­ceive us or work against us” (be­cause that stan­dard will likely be reached any­way). Do you agree? Yes. Some types of AI x-risk don’t af­fect ev­ery­one though (e.g., ones that re­duce the long term value of the uni­verse or mul­ti­verse with­out kil­ling ev­ery­one in the near term). Agreed, all else equal those seem more likely to me. • Ok, I wasn’t sure that you’d agree, but given that you do, it seems that when you wrote the ti­tle of this newslet­ter “Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longter­mists” you must have meant “Why some forms of AI risk …”, or per­haps cer­tain forms of AI risk just didn’t come to your mind at that time. In ei­ther case it seems worth clar­ify­ing some­where that you don’t cur­rently en­dorse in­ter­pret­ing “AI risk” as “AI risk in its en­tirety” in that sen­tence. Similarly, on the in­side you wrote: The main rea­son I am op­ti­mistic about AI safety is that we will see prob­lems in ad­vance, and we will solve them, be­cause no­body wants to build un­al­igned AI. A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here. It seems worth clar­ify­ing that you’re only op­ti­mistic about cer­tain types of AI safety prob­lems. (I’m ba­si­cally mak­ing the same com­plaint/​sug­ges­tion that I made to Matthew Bar­nett not too long ago. I don’t want to be too repet­i­tive or an­noy­ing, so let me know if I’m start­ing to sound that way.) • It seems worth clar­ify­ing that you’re only op­ti­mistic about cer­tain types of AI safety prob­lems. Tbc, I’m op­ti­mistic about all the types of AI safety prob­lems that peo­ple have pro­posed, in­clud­ing the philo­soph­i­cal ones. When I said “all else equal those seem more likely to me”, I meant that if all the other facts about the mat­ter are the same, but one risk af­fects only fu­ture peo­ple and not cur­rent peo­ple, that risk would seem more likely to me be­cause peo­ple would care less about it. But I am op­ti­mistic about the ac­tual risks that you and oth­ers ar­gue for. That said, over the last week I have be­come less op­ti­mistic speci­fi­cally about over­com­ing race dy­nam­ics, mostly from talk­ing to peo­ple at FHI /​ GovAI. I’m not sure how much to up­date though. (Still broadly op­ti­mistic.) it seems that when you wrote the ti­tle of this newslet­ter “Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longter­mists” you must have meant “Why some forms of AI risk …”, or per­haps cer­tain forms of AI risk just didn’t come to your mind at that time. It’s no­table that AI Im­pacts asked for peo­ple who were skep­ti­cal of AI risk (or some­thing along those lines) and to my eye it looks like all four of the peo­ple in the newslet­ter in­de­pen­dently in­ter­preted that as ac­ci­den­tal tech­ni­cal AI risk in which the AI is ad­ver­sar­i­ally op­ti­miz­ing against you (or at least that’s what the four peo­ple ar­gued against). This seems like pretty strong ev­i­dence that when peo­ple hear “AI risk” they now think of tech­ni­cal ac­ci­den­tal AI risk, re­gard­less of what the his­tor­i­cal defi­ni­tion may have been. I know cer­tainly that is my de­fault as­sump­tion when some­one (other than you) says “AI risk”. I would cer­tainly sup­port hav­ing clearer defi­ni­tions and ter­minol­ogy if we could all agree on them. • But I am op­ti­mistic about the ac­tual risks that you and oth­ers ar­gue for. Why? I ac­tu­ally wrote a re­ply that was more ques­tion­ing in tone, and then changed it be­cause I found some com­ments you made where you seemed to be con­cerned about the ad­di­tional AI risks. Good thing I saved a copy of the origi­nal re­ply, so I’ll just paste it be­low: I won­der if you would con­sider writ­ing an overview of your per­spec­tive on AI risk strat­egy. (You do have a se­quence but I’m look­ing for some­thing that’s more com­pre­hen­sive, that in­cludes e.g. hu­man safety and philo­soph­i­cal prob­lems. Or let me know if there’s an ex­ist­ing post that I’ve missed.) I ask be­cause you’re one of the most pro­lific par­ti­ci­pants here but don’t fall into one of the ex­ist­ing “camps” on AI risk for whom I already have good mod­els for. It’s hap­pened sev­eral times that I see a com­ment from you that seems wrong or un­clear, but I’m afraid to risk be­ing an­noy­ing or repet­i­tive with my ques­tions/​ob­jec­tions. (I some­times worry that I’ve already brought up some is­sue with you and then for­got your an­swer.) It would help a lot to have a bet­ter model of you in my head and in writ­ing so I can re­fer to that to help me in­ter­pret what the most likely in­tended mean­ing of a com­ment is, or to pre­dict how you would likely an­swer if I were to ask cer­tain ques­tions. It’s no­table that AI Im­pacts asked for peo­ple who were skep­ti­cal of AI risk (or some­thing along those lines) and to my eye it looks like all four of the peo­ple in the newslet­ter in­de­pen­dently in­ter­preted that as ac­ci­den­tal tech­ni­cal AI risk in which the AI is ad­ver­sar­i­ally op­ti­miz­ing against you (or at least that’s what the four peo­ple ar­gued against). Maybe that’s be­cause the ques­tion was asked in a way that in­di­cated the ques­tioner was mostly in­ter­ested in tech­ni­cal ac­ci­den­tal AI risk? And some of them may be fine with defin­ing “AI risk” as “AI-caused x-risk” but just didn’t have the other risks on the top of their minds, be­cause their per­sonal fo­cus is on the tech­ni­cal/​ac­ci­den­tal side. In other words I don’t think this is strong ev­i­dence that all 4 peo­ple would en­dorse defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk”. It also seems no­table that I’ve been us­ing “AI risk” in a broad sense for a while and no one has ob­jected to that us­age un­til now. I would cer­tainly sup­port hav­ing clearer defi­ni­tions and ter­minol­ogy if we could all agree on them. The cur­rent situ­a­tion seems to be that we have two good (rel­a­tively clear) terms “tech­ni­cal ac­ci­den­tal AI risk” and “AI-caused x-risk” and the dis­pute is over what plain “AI risk” should be short­hand for. Does that seem fair? • I ask be­cause you’re one of the most pro­lific par­ti­ci­pants here but don’t fall into one of the ex­ist­ing “camps” on AI risk for whom I already have good mod­els for. Seems right, I think my opinions fall clos­est to Paul’s, though it’s also hard for me to tell what Paul’s opinions are. I think this older thread is a rel­a­tively good sum­mary of the con­sid­er­a­tions I tend to think about, though I’d place differ­ent em­phases now. (Sadly I don’t have the time to write a proper post about what I think about AI strat­egy—it’s a pretty big topic.) The cur­rent situ­a­tion seems to be that we have two good (rel­a­tively clear) terms “tech­ni­cal ac­ci­den­tal AI risk” and “AI-caused x-risk” and the dis­pute is over what plain “AI risk” should be short­hand for. Does that seem fair? Yes, though I would frame it as “the ~5 peo­ple read­ing these com­ments have two clear terms, while ev­ery­one else uses a con­fus­ing mish­mash of terms”. The hard part is in get­ting ev­ery­one else to use the terms. I am gen­er­ally skep­ti­cal of de­cid­ing on defi­ni­tions and get­ting ev­ery­one else to use them, and usu­ally try to use terms the way other peo­ple use terms. In other words I don’t think this is strong ev­i­dence that all 4 peo­ple would en­dorse defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk”. It also seems no­table that I’ve been us­ing “AI risk” in a broad sense for a while and no one has ob­jected to that us­age un­til now. Agreed with this, but see above about try­ing to con­form with the way terms are used, rather than defin­ing terms and try­ing to drag ev­ery­one else along. • see above about try­ing to con­form with the way terms are used, rather than defin­ing terms and try­ing to drag ev­ery­one else along. This seems odd given your ob­jec­tion to “soft/​slow” take­off us­age and your ad­vo­cacy of “con­tin­u­ous take­off” ;) • I don’t think “soft/​slow take­off” has a canon­i­cal mean­ing—some peo­ple (e.g. Paul) in­ter­pret it as not hav­ing dis­con­ti­nu­ities, while oth­ers in­ter­pret it as ca­pa­bil­ities in­creas­ing slowly past hu­man in­tel­li­gence over (say) cen­turies (e.g. Su­per­in­tel­li­gence). If I say “slow take­off” I don’t know which one the listener is go­ing to hear it as. (And if I had to guess, I’d ex­pect they think about the cen­turies-long ver­sion, which is usu­ally not the one I mean.) In con­trast, I think “AI risk” has a much more canon­i­cal mean­ing, in that if I say “AI risk” I ex­pect most listen­ers to in­ter­pret it as ac­ci­den­tal risk caused by the AI sys­tem op­ti­miz­ing for goals that are not our own. (Per­haps an im­por­tant point is that I’m try­ing to com­mu­ni­cate to a much wider au­di­ence than the peo­ple who read all the Align­ment Fo­rum posts and com­ments. I’d feel more okay about “slow take­off” if I was just speak­ing to peo­ple who have read many of the posts already ar­gu­ing about take­off speeds.) • AI risk is just a short­hand for “ac­ci­den­tal tech­ni­cal AI risk.” To the ex­tent that peo­ple are con­fused, I agree it’s prob­a­bly worth clar­ify­ing the type of risk by adding “ac­ci­den­tal” and “tech­ni­cal” when­ever we can. How­ever, I dis­agree with the idea that we should ex­pand the word AI risk to in­clude philo­soph­i­cal failures and in­ten­tional risks. If you open the term up, these out­comes might start to hap­pen: • It be­comes un­clear in con­ver­sa­tion what peo­ple mean when they say AI risk • Like The Sin­gu­lar­ity, it be­comes a buz­zword. • Jour­nal­ists start pro­ject­ing Ter­mi­na­tor sce­nar­ios onto the words, and now have jus­tifi­ca­tion be­cause even the re­searchers say that AI risk can mean a lot of differ­ent things. • It puts a whole bunch of types of risk into one bas­ket, sug­gest­ing to out­siders that all at­tempts to re­duce “AI risk” might be equally worth­while. • ML re­searchers start to dis­trust AI risk re­searchers, be­cause peo­ple who are wor­ried about the Ter­mi­na­tor are us­ing the same words as the AI risk re­searchers and there­fore get as­so­ci­ated with them. This can all be avoided by hav­ing a com­mu­nity norm to clar­ify that we mean tech­ni­cal ac­ci­den­tal risk when we say AI risk, and when we’re talk­ing about other types of risks we use more pre­cise ter­minol­ogy. • AI risk is just a short­hand for “ac­ci­den­tal tech­ni­cal AI risk.” I don’t think “AI risk” was origi­nally meant to be a short­hand for “ac­ci­den­tal tech­ni­cal AI risk”. The ear­liest con­sid­ered (i.e., not off-hand) us­age I can find is in the ti­tle of Luke Muehlhauser’s AI Risk and Op­por­tu­nity: A Strate­gic Anal­y­sis where he defined it as “the risk of AI-caused ex­tinc­tion”. (He used “ex­tinc­tion” but nowa­days we tend think in terms of “ex­is­ten­tial risk” which also in­cludes “per­ma­nent large nega­tive con­se­quences”, which seems like an rea­son­able ex­pan­sion of “AI risk”.) How­ever, I dis­agree with the idea that we should ex­pand the word AI risk to in­clude philo­soph­i­cal failures and in­ten­tional risks. I want to in­clude philo­soph­i­cal failures, as long as the con­se­quences of the failures flow through AI, be­cause (aside from his­tor­i­cal us­age) tech­ni­cal prob­lems and philo­soph­i­cal prob­lems blend into each other, and I don’t see a point in draw­ing an ar­bi­trary and po­ten­tially con­tentious bor­der be­tween them. (Is UDT a tech­ni­cal ad­vance or a philo­soph­i­cal ad­vance? Is defin­ing the right util­ity func­tion for a Sovereign Sin­gle­ton a tech­ni­cal prob­lem or a philo­soph­i­cal prob­lem? Why force our­selves to an­swer these ques­tions?) As for “in­ten­tional risks” it’s already com­mon prac­tice to in­clude that in “AI risk”: Di­vid­ing AI risks into mi­suse risks and ac­ci­dent risks has be­come a pre­vailing ap­proach in the field. Be­sides that, I think there’s also a large grey area be­tween “ac­ci­dent risk” and “mi­suse” where the risk partly comes from tech­ni­cal/​philo­soph­i­cal prob­lems and partly from hu­man na­ture. For ex­am­ple hu­mans might be eas­ily per­suaded by wrong but psy­cholog­i­cally con­vinc­ing moral/​philo­soph­i­cal ar­gu­ments that AIs can come up with and then or­der their AIs to do ter­rible things. Even pure in­ten­tional risks might have tech­ni­cal solu­tions. Again I don’t re­ally see the point of try­ing to figure out which of these prob­lems should be ex­cluded from “AI risk”. It be­comes un­clear in con­ver­sa­tion what peo­ple mean when they say AI risk It seems perfectly fine to me to use that as short­hand for “AI-caused x-risk” and use more spe­cific terms when we mean more spe­cific risks. Like The Sin­gu­lar­ity, it be­comes a buzzword What do you mean? Like peo­ple will use “AI risk” when their pro­ject has noth­ing to do with “AI-caused x-risk”? Couldn’t they do that even if we define “AI risk” to be “ac­ci­den­tal tech­ni­cal AI risk”? Jour­nal­ists start pro­ject­ing Ter­mi­na­tor sce­nar­ios onto the words, and now have jus­tifi­ca­tion be­cause even the re­searchers say that AI risk can mean a lot of differ­ent things. Ter­mi­na­tor sce­nar­ios seem to be sce­nar­ios of “ac­ci­den­tal tech­ni­cal AI risk” (they’re just not very re­al­is­tic sce­nar­ios) so I don’t see how defin­ing “AI risk” to mean that would pre­vent jour­nal­ists from us­ing Ter­mi­na­tor sce­nar­ios to illus­trate “AI risk”. It puts a whole bunch of types of risk into one bas­ket, sug­gest­ing to out­siders that all at­tempts to re­duce “AI risk” might be equally worth­while. I don’t think this is a good ar­gu­ment, be­cause even within “ac­ci­den­tal tech­ni­cal AI risk” there are differ­ent prob­lems that aren’t equally worth­while to solve, so why aren’t you already wor­ried about out­siders think­ing all those prob­lems are equally worth­while? ML re­searchers start to dis­trust AI risk re­searchers, be­cause peo­ple who are wor­ried about the Ter­mi­na­tor are us­ing the same words as the AI risk re­searchers and there­fore get as­so­ci­ated with them. See my re­sponse above re­gard­ing “Ter­mi­na­tor sce­nar­ios”. This can all be avoided by hav­ing a com­mu­nity norm to clar­ify that we mean tech­ni­cal ac­ci­den­tal risk when we say AI risk, and when we’re talk­ing about other types of risks we use more pre­cise ter­minol­ogy. I pro­pose that we in­stead stick with his­tor­i­cal prece­dent and keep “AI risk” to mean “AI-caused x-risk” and use more pre­cise ter­minol­ogy to re­fer to more spe­cific types of AI-caused x-risk that we might want to talk about. Aside from what I wrote above, it’s just more in­tu­itive/​com­mon­sen­si­cal that “AI risk” means “AI-caused x-risk” in gen­eral in­stead of a spe­cific kind of AI-caused x-risk. How­ever I ap­pre­ci­ate that some­one who works mostly on the less philo­soph­i­cal /​ less hu­man-re­lated prob­lems might find it tire­some to say or type “tech­ni­cal ac­ci­den­tal AI risk” all the time to de­scribe what they do or to dis­cuss the im­por­tance of their work, and can find it very tempt­ing to just use “AI risk”. It would prob­a­bly be good to cre­ate a (differ­ent) short­hand or acronym for it to re­move this temp­ta­tion and to make their lives eas­ier. • I ap­pre­ci­ate the ar­gu­ments, and I think you’ve mostly con­vinced me, mostly be­cause of the his­tor­i­cal ar­gu­ment. I do still have some re­main­ing ap­pre­hen­sion about us­ing AI risk to de­scribe ev­ery type of risk aris­ing from AI. I want to in­clude philo­soph­i­cal failures, as long as the con­se­quences of the failures flow through AI, be­cause (aside from his­tor­i­cal us­age) tech­ni­cal prob­lems and philo­soph­i­cal prob­lems blend into each other, and I don’t see a point in draw­ing an ar­bi­trary and po­ten­tially con­tentious bor­der be­tween them. That is true. The way I see it, UDT is definitely on the tech­ni­cal side, even though it in­cor­po­rates a large amount of philo­soph­i­cal back­ground. When I say tech­ni­cal, I mostly mean “spe­cific, uses math, has clear mean­ing within the lan­guage of com­puter sci­ence” rather than a more nar­row mean­ing of “is re­lated to ma­chine learn­ing” or some­thing similar. My is­sue with ar­gu­ing for philo­soph­i­cal failure is that, as I’m sure you’re aware, there’s a well known failure mode of wor­ry­ing about vague philo­soph­i­cal prob­lems rather than more con­crete ones. Within aca­demic philos­o­phy, the ma­jor­ity of dis­cus­sion sur­round­ing AI is cen­tered around con­scious­ness, in­ten­tion­al­ity, whether it’s pos­si­ble to even con­struct a hu­man-like ma­chine, whether they should have rights etc. There’s a unique thread of philos­o­phy that arose from Less­wrong, which in­cludes work on de­ci­sion the­ory, that doesn’t fo­cus on these thorny and low pri­or­ity ques­tions. While I’m com­fortable with you ar­gu­ing that philo­soph­i­cal failure is im­por­tant, my im­pres­sion is that the overly philo­soph­i­cal ap­proach used by many peo­ple has done more harm than good for the field in the past, and con­tinues to do so. It is there­fore some­times nice to tell peo­ple that the prob­lems that peo­ple work on here are con­crete and spe­cific, and don’t re­quire do­ing a ton of ab­stract philos­o­phy or poli­ti­cal ad­vo­cacy. I don’t think this is a good ar­gu­ment, be­cause even within “ac­ci­den­tal tech­ni­cal AI risk” there are differ­ent prob­lems that aren’t equally worth­while to solve, so why aren’t you already wor­ried about out­siders think­ing all those prob­lems are equally worth­while? This is true, but my im­pres­sion is that when you tell peo­ple that a prob­lem is “tech­ni­cal” it gen­er­ally makes them re­frain from hav­ing a strong opinion be­fore un­der­stand­ing a lot about it. “Ac­ci­den­tal” also re­frames the dis­cus­sion by re­duc­ing the risk of po­lariz­ing bi­ases. This is a com­mon theme in many fields: • Physi­cists some­times get frus­trated with peo­ple ar­gu­ing about “the philos­o­phy of the in­ter­pre­ta­tion of quan­tum me­chan­ics” be­cause there’s a large sub­set of peo­ple who think that since it’s philo­soph­i­cal, then you don’t need to have any sub­ject-level ex­per­tise to talk about it. • Economists try to em­pha­size that they use mod­els and em­piri­cal data, be­cause a lot of peo­ple think that their field of study is more-or-less just high sta­tus opinion + math. Em­pha­siz­ing that there are real, spe­cific mod­els that they study helps to re­duce this im­pres­sion. Same with poli­ti­cal sci­ence. • A large frac­tion of tech work­ers are frus­trated about the use of Ma­chine Learn­ing as a buz­zword right now, and part of it is that peo­ple started say­ing Ma­chine Learn­ing = AI rather than Ma­chine Learn­ing = Statis­tics, and so a lot of peo­ple thought that even if they don’t un­der­stand statis­tics, they can un­der­stand AI since that’s like philos­o­phy and stuff. Scott Aaron­son has said But I’ve drawn much closer to the com­mu­nity over the last few years, be­cause of a com­bi­na­tion of fac­tors: [...] The AI-risk folks started pub­lish­ing some re­search pa­pers that I found in­ter­est­ing—some with rel­a­tively ap­proach­able prob­lems that I could see my­self try­ing to think about if quan­tum com­put­ing ever got bor­ing. This shift seems to have hap­pened at roughly around the same time my former stu­dent, Paul Chris­ti­ano, “defected” from quan­tum com­put­ing to AI-risk re­search. My guess is that this shift in his think­ing oc­curred be­cause a lot of peo­ple started talk­ing about tech­ni­cal risks from AI, rather than fram­ing it as a philos­o­phy prob­lem, or a prob­lem of elimi­nat­ing bad ac­tors. Eliezer has shared this view­point for years, writ­ing in the CEV doc­u­ment, Warn­ing: Be­ware of things that are fun to ar­gue. re­flect­ing the temp­ta­tion to de­rail dis­cus­sions about tech­ni­cal ac­ci­den­tal risks. • Also, isn’t defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk” analo­gous to defin­ing “ap­ple” as “red ap­ple” (in terms of be­ing cir­cu­lar/​illog­i­cal)? I re­al­ize nat­u­ral lan­guage doesn’t have to be perfectly log­i­cal, but this still seems a bit too egre­gious. • I agree that this is trou­bling, though I think it’s similar to how I wouldn’t want the term biorisk to be ex­panded to in­clude bio­di­ver­sity loss (a risk, but not the right type), reg­u­lar hu­man ter­ror­ism (hu­mans are biolog­i­cal, but it’s a to­tally differ­ent is­sue), zom­bie up­ris­ings (they are biolog­i­cal, but it’s to­tally ridicu­lous), alien in­va­sions etc. Not to say that’s what you are do­ing with AI risk. I’m wor­ried about what oth­ers will do with it if the term gets ex­panded. • I agree that this is trou­bling, though I think it’s similar to how I wouldn’t want the term biorisk to be ex­panded … Well as I said, nat­u­ral lan­guage doesn’t have to be perfectly log­i­cal, and I think “biorisk” is in some­what in that cat­e­gory but there’s an ex­pla­na­tion that makes it a bit rea­son­able than it might first ap­pear, which is that the “bio” refers not to “biolog­i­cal” but to “bioweapon”. This is ac­tu­ally one of the defi­ni­tions that Google gives when you search for “bio”: “re­lat­ing to or in­volv­ing the use of toxic biolog­i­cal or bio­chem­i­cal sub­stances as weapons of war. ‘bioter­ror­ism’” I guess the analo­gous thing would be if we start us­ing “AI” to mean “tech­ni­cal AI ac­ci­dents” in a bunch of phrases, which feels worse to me than the “bio” case, maybe be­cause “AI” is a stan­dalone word/​acronym in­stead of a pre­fix? Does this make sense to you? Not to say that’s what you are do­ing with AI risk. I’m wor­ried about what oth­ers will do with it if the term gets ex­panded. But the term was ex­panded from the be­gin­ning. Have you ac­tu­ally ob­served it be­ing used in ways that you fear (and which would be pre­vented if we were to re­define it more nar­rowly)? • Does this make sense to you? Yeah that makes sense. Your points about “bio” not be­ing short for “biolog­i­cal” were valid, but the fact that as a listener I didn’t know that fact im­plies that it seems re­ally easy to mess up the lan­guage us­age here. I’m start­ing to think that the real fight should be about us­ing terms that aren’t self ex­plana­tory. Have you ac­tu­ally ob­served it be­ing used in ways that you fear (and which would be pre­vented if we were to re­define it more nar­rowly)? I’m not sure about whether it would have been pre­vented by us­ing the term more nar­rowly, but in my ex­pe­rience the most com­mon re­ac­tion peo­ple out­side of EA/​LW (and even some­times within) have to hear­ing about AI risk is to as­sume that it’s not tech­ni­cal, and to as­sume that it’s not about ac­ci­dents. In that sense, I have seen been ex­posed to quite a bit of this already. • As far as I can tell, it took a while for in­di­vi­d­u­als to no­tice, longer for it to be­come com­mon knowl­edge, and even more time for any­one to do any­thing about it. Tan­gen­tial, but I wouldn’t be sur­prised if re­searchers were fairly quickly aware of the is­sue (e.g. within two years of the origi­nal GAN pa­per), but it took a while to be­come com­mon knowl­edge be­cause it isn’t par­tic­u­larly flashy. (There’s a sur­pris­ing-to-me amount of know-how that is stored in re­searcher’s brains and never put down on pa­per.) Even now, the “solu­tions” are hacks that don’t com­pletely re­solve the is­sue. I mean, the solu­tion is to use a VAE. If you care about cov­er­ing modes but not image qual­ity, you choose a VAE; if you care about image qual­ity but not cov­er­ing modes, you choose a GAN. (Also, while I know very lit­tle about VAEs /​ GANs, Im­plicit Max­i­mum Like­li­hood Es­ti­ma­tion sounded like a prin­ci­pled fix to me.) This strikes me as a “ran­dom fix” where the core is­sue was that the sys­tem did not have suffi­cient dis­crim­i­na­tory power to tell apart a safe situ­a­tion from an un­safe situ­a­tion. In­stead of prop­erly solv­ing this prob­lem, the re­searchers put in a hack. Agreed, I would guess that the re­searchers /​ en­g­ineers knew this was risky and thought it was worth it any­way. Or per­haps the man­agers did. But I do agree this is ev­i­dence against my po­si­tion. I agree that we shouldn’t be wor­ried about situ­a­tions where there is a clear threat. But that’s not quite the class of failures that I’m wor­ried about. [...] Later, the prob­lems are dis­cov­ered and the re­ac­tion is to hack to­gether a solu­tion. Why isn’t the threat clear once the prob­lems are dis­cov­ered? un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI. Part of my claim is that we prob­a­bly will get that (as­sum­ing AI re­ally is risky), though per­haps not Ch­er­nobyl-level dis­aster, but still some­thing with real nega­tive con­se­quences that “could be worse”. • Why isn’t the threat clear once the prob­lems are dis­cov­ered? I think I should be more spe­cific, when you say: Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug? I mean that no one sane who knows that will run that AI sys­tem with > X amount of com­put­ing power. When I wrote that com­ment I also thought that no one sane would not blow the whis­tle in that event. See my note at the end of the com­ment.* How­ever, when pre­sented with that ev­i­dence, I don’t ex­pect the AI com­mu­nity to re­act ap­pro­pri­ately. The cor­rect re­sponse to that ev­i­dence is to stop what your do­ing, and re­visit the en­tire pro­cess and cul­ture that led to the cre­ation of an al­gorithm that will kill us all if run with >X amount of com­pute. What I ex­pect will hap­pen is that the AI com­mu­nity will try and solve the prob­lem the same way it’s solved ev­ery other prob­lem it has en­coun­tered. It will try an in­or­di­nate amount of un­prin­ci­pled hacks to get around the is­sue. Part of my claim is that we prob­a­bly will get that (as­sum­ing AI re­ally is risky), though per­haps not Ch­er­nobyl-level dis­aster, but still some­thing with real nega­tive con­se­quences that “could be worse”. Con­di­tional on no FOOM, I can definitely see plenty of events with real nega­tive con­se­quences that “could be worse”. How­ever, I claim that any­thing short of a Ch­er­nobyl level event won’t shock the com­mu­nity and the world into chang­ing it’s cul­ture or try­ing to co­or­di­nate. I also claim that the ca­pa­bil­ities gap be­tween a Ch­er­nobyl level event and a global catas­trophic event is small, such that even in a non-FOOM sce­nario the former might not hap­pen be­fore the lat­ter. To­gether, I think that there is a high prob­a­bil­ity that we will not get a dis­aster that is scary enough to get the AI com­mu­nity to change it’s cul­ture and co­or­di­nate be­fore it’s too late. *Now that I think about it more though, I’m less sure. Un­der­grad­u­ate en­g­ineers get en­tire lec­tures ded­i­cated to how and when to blow the whis­tle when faced with un­eth­i­cal cor­po­rate prac­tices and dan­ger­ous pro­jects or de­signs. When work­ing, they also have in­surance and some de­gree of le­gal pro­tec­tion from venge­ful em­ploy­ers. Even then, you still see cover ups of short­com­ings that lead to ma­jor in­dus­trial dis­asters. For in­stance, long be­fore the dis­aster, some­one had de­ter­mined that the fukushima plant was in­deed vuln­er­a­ble to large tsunami im­pacts. The pat­tern where some­one knows that some­thing will go wrong but noth­ing is done to pre­vent it for one rea­son or an­other is not that un­com­mon in en­g­ineer­ing dis­asters. Re­gard­less of whether this is due to hind­sight bias or an in­ad­e­quate pro­cess for ad­dress­ing safety is­sues, these dis­asters still hap­pen reg­u­larly in fields with far more con­ser­va­tive, cau­tious, and safety ori­ented cul­tures. I find it un­likely that the field of AI will change it’s cul­ture from one of mov­ing fast and hack­ing to some­thing even more con­ser­va­tive and cau­tious than the cul­tures of con­sumer aerospace and nu­clear en­g­ineer­ing. • Idk, I don’t know what to say here. I meet lots of AI re­searchers, and the best ones seem to me to be quite thought­ful. I can say what would change my mind: I take the ex­plo­ra­tion of un­prin­ci­pled hacks as very weak ev­i­dence against my po­si­tion, if it’s just in an aca­demic pa­per. My guess is the re­searchers them­selves would not ad­vo­cate de­ploy­ing their solu­tion, or would say that it’s worth de­ploy­ing but it’s an in­cre­men­tal im­prove­ment that doesn’t solve the full prob­lem. And even if the re­searchers don’t say that, I sus­pect the com­pa­nies ac­tu­ally de­ploy­ing the sys­tems would worry about it. I would take the de­ploy­ment of un­prin­ci­pled hacks more se­ri­ously as ev­i­dence, but even there I would want to be con­vinced that shut­ting down the AI sys­tem was a bet­ter de­ci­sion than de­ploy­ing an un­prin­ci­pled hack. (Be­cause then I would have made the same de­ci­sion in their shoes.) Un­prin­ci­pled hacks are in fact quite use­ful for the vast ma­jor­ity of prob­lems; as a re­sult it seems wrong to at­tribute ir­ra­tional­ity to peo­ple be­cause they use un­prin­ci­pled hacks. • I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world. Cur­rent ML cul­ture is to test 100′s of things in a lab un­til one works. This is fine as long as the AI’s be­ing tested are not smart enough to break out of the lab, or re­al­ize they are be­ing tested and play nice un­til de­ploy­ment. The de­fault way to test a de­sign is to run it and see, not to rea­son ab­stractly about it. and then we’ll have a prob­lem that is both very bad and (more) clearly real, and that’s when I ex­pect that it will be taken se­ri­ously. Part of the prob­lem is that we have a re­ally strong unilat­er­al­ist’s curse. It only takes 1, or a few peo­ple who don’t re­al­ize the prob­lem to make some­thing re­ally dan­ger­ous. Ban­ning it is also hard, law en­force­ment isn’t 100% effec­tive, differ­ent coun­tries have differ­ent laws and the main real world in­gre­di­ent is ac­cess to a com­puter. If the long-term con­cerns are real, we should get more ev­i­dence about them in the fu­ture, …I ex­pect that it will be taken se­ri­ously. The peo­ple who are ig­nor­ing or don’t un­der­stand the cur­rent ev­i­dence will carry on ig­nor­ing or not un­der­stand­ing it. A few more peo­ple will be con­vinced, but don’t ex­pect to con­vince a cre­ation­ist with one more tran­si­tional fos­sil. • Part of the prob­lem is that we have a re­ally strong unilat­er­al­ist’s curse. It only takes 1, or a few peo­ple who don’t re­al­ize the prob­lem to make some­thing re­ally dan­ger­ous. This is a foom-ish as­sump­tion; re­mem­ber that Ro­hin is ex­plic­itly talk­ing about a non-foom sce­nario. • ^ Yeah, in FOOM wor­lds I agree more with your (Don­ald’s) rea­son­ing. (Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror?) The peo­ple who are ig­nor­ing or don’t un­der­stand the cur­rent ev­i­dence will carry on ig­nor­ing or not un­der­stand­ing it. I don’t think we have good cur­rent ev­i­dence, so I don’t in­fer much about whether or not peo­ple will buy fu­ture ev­i­dence from their re­ac­tions to cur­rent ev­i­dence. (See also six heuris­tics that I think cut against AI risk even af­ter know­ing the ar­gu­ments for AI risk.) • Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror? You men­tioned that, con­di­tional on foom, you’d be con­fused about what the world looks like. Is this the main thing you’re con­fused about in foom wor­lds, or are there other ma­jor things too? • Lots of other things: • Are we imag­in­ing a small team of hack­ers in their base­ment try­ing to get AGI on a lap­top, or a big cor­po­ra­tion us­ing tons of re­sources? • How does the AGI learn about the world? If you say “it reads the In­ter­net”, how does it learn to read? • When the de­vel­op­ers re­al­ize that they’ve built AGI, is it still pos­si­ble for them to pull the plug? • Why doesn’t the AGI try to be de­cep­tive in ways that we can de­tect, the way chil­dren do? Is it just im­me­di­ately as ca­pa­ble as a smart hu­man and doesn’t need any train­ing? How can that hap­pen by just “find­ing the right ar­chi­tec­ture”? • Why is this likely to hap­pen soon when it hasn’t hap­pened in the last sixty years? I sus­pect an­swers to these will pro­voke lots of other ques­tions. In con­trast, the non-foom wor­lds that still in­volve AGI + very fast growth seem much closer to a “busi­ness-as-usual” world. I also think that if you’re wor­ried about foom, you should ba­si­cally not care about any of the work be­ing done at Deep­Mind /​ OpenAI right now, be­cause that’s not the kind of work that can foom (ex­cept in the “we sud­denly find the right ar­chi­tec­ture” story); yet I no­tice lots of doomy pre­dic­tions about AGI are be­ing driven by DM /​ OAI’s work. (Of course, plau­si­bly you think OpenAI /​ DM are not go­ing to suc­ceed, even if oth­ers do.) • I’m go­ing to start a fresh thread on this, it sounds more in­ter­est­ing (at least to me) than most of the other stuff be­ing dis­cussed here. • Yeah, in FOOM wor­lds I agree more with your (Don­ald’s) rea­son­ing. (Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror?) If there’s an im­plicit as­sump­tion here that FOOM wor­lds re­quire some­one to stum­ble upon “the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence”, I don’t un­der­stand why such an as­sump­tion is jus­tified. For ex­am­ple, sup­pose that at some point in the fu­ture some top AI lab will throw$1B at a sin­gle mas­sive neu­ral ar­chi­tec­ture search—over some ar­bi­trary slightly-novel ar­chi­tec­ture space—and that NAS will stum­ble upon some com­pli­cated ar­chi­tec­ture that its cor­re­spond­ing model, af­ter be­ing trained with a mas­sive amount of com­put­ing power, will im­ple­ment an AGI.

• and that NAS will stum­ble upon some com­pli­cated ar­chi­tec­ture that its cor­re­spond­ing model, af­ter be­ing trained with a mas­sive amount of com­put­ing power, will im­ple­ment an AGI.

In this case I’m ask­ing why the NAS stum­bled upon the cor­rect math­e­mat­i­cal ar­chi­tec­ture un­der­ly­ing in­tel­li­gence.

Or rather, let’s dis­pense with the word “math­e­mat­i­cal” (which I mainly used be­cause it seems to me that the ar­gu­ments for FOOM usu­ally in­volve some­one com­ing up with the right math­e­mat­i­cal in­sight un­der­ly­ing in­tel­li­gence).

It seems to me that to get FOOM you need the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”, which I’ll call frag­ility. Note that you can­not find frag­ile things us­ing lo­cal search, ex­cept if you “get lucky” and start out at the cor­rect solu­tion.

Why did the NAS stum­ble upon the cor­rect frag­ile ar­chi­tec­ture un­der­ly­ing in­tel­li­gence?

• It seems to me that to get FOOM you need the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”

The above ‘FOOM via $1B NAS’ sce­nario doesn’t seem to me to re­quire this prop­erty. No­tice that the in­crease in ca­pa­bil­ities dur­ing that NAS may be grad­ual (i.e. be­fore eval­u­at­ing the model that im­ple­ments an AGI the NAS eval­u­ates mod­els that are “al­most AGI”). The sce­nario would still count as a FOOM as long as the NAS yields an AGI and no model be­fore that NAS ever came close to AGI. Con­di­tioned on [$1B NAS yields the first AGI], a FOOM seems to me par­tic­u­larly plau­si­ble if ei­ther:

1. no pre­vi­ous NAS at a similar scale was ever car­ried out; or

2. the “path in model space” that the NAS tra­verses is very differ­ent from all the paths that pre­vi­ous NASs tra­versed. This seems to me plau­si­ble even if the model space of the $1B NAS is iden­ti­cal to ones used in pre­vi­ous NASs (e.g. if differ­ent ran­dom seeds yield very differ­ent paths); and it seems to me even more plau­si­ble if the model space of the$1B NAS is slightly novel.

• The above ‘FOOM via $1B NAS’ sce­nario doesn’t seem to me to re­quire this prop­erty. No­tice that the in­crease in ca­pa­bil­ities dur­ing that NAS may be grad­ual (i.e. be­fore eval­u­at­ing the model that im­ple­ments an AGI the NAS eval­u­ates mod­els that are “al­most AGI”). The sce­nario would still count as a FOOM as long as the NAS yields an AGI and no model be­fore that NAS ever came close to AGI. In this case I’d ap­ply the frag­ility ar­gu­ment to the re­search pro­cess, which was my origi­nal point (though it wasn’t phrased as well then). In the NAS set­ting, my ques­tion is: how ex­actly did some­one stum­ble upon the cor­rect NAS to run that would lead to in­tel­li­gence by trial and er­ror? Ba­si­cally, if you’re ar­gu­ing that most ML re­searchers just do a bunch of trial-and-er­ror, then you should be mod­el­ing ML re­search as a lo­cal search in idea-space, and then you can ap­ply the same frag­ility ar­gu­ment to it. • Con­di­tioned on [$1B NAS yields the first AGI], that NAS it­self may es­sen­tially be “a lo­cal search in idea-space”. My ar­gu­ment is that such a lo­cal search in idea-space need not start in a world where “al­most-AGI” mod­els already ex­ist (I listed in the grand­par­ent two dis­junc­tive rea­sons in sup­port of this).

Re­lat­edly, “mod­el­ing ML re­search as a lo­cal search in idea-space” is not nec­es­sar­ily con­tra­dic­tory to FOOM, if an im­por­tant part of that lo­cal search can be car­ried out with­out hu­man in­volve­ment (which is a sup­po­si­tion that seems to be sup­ported by the rise of NAS and meta-learn­ing ap­proaches in re­cent years).

I don’t see how my rea­son­ing here re­lies on it be­ing pos­si­ble to “find frag­ile things us­ing lo­cal search”.

• (I listed in the grand­par­ent two dis­junc­tive rea­sons in sup­port of this).

Okay, re­spond­ing to those di­rectly:

no pre­vi­ous NAS at a similar scale was ever car­ried out; or

• What caused the re­searchers to go from “$1M run of NAS” to “$1B run of NAS”, with­out first try­ing “$10M run of NAS”? I es­pe­cially have this ques­tion if you’re mod­el­ing ML re­search as “trial and er­ror”; I can imag­ine jus­tify­ing a$1B ex­per­i­ment be­fore a $10M ex­per­i­ment if you have some com­pel­ling rea­son that the re­sult you want will hap­pen with the$1B ex­per­i­ment but not the $10M ex­per­i­ment; but if you’re do­ing trial and er­ror then you don’t have a com­pel­ling rea­son. • Cur­rent AI sys­tems are very sub­hu­man, and throw­ing more money at NAS has led to rel­a­tively small im­prove­ments. Why don’t we ex­pect similar in­cre­men­tal im­prove­ments from the next 3-4 or­ders of mag­ni­tude of com­pute? • Sup­pose that such a NAS did lead to hu­man-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did? How does that cause a FOOM? (Yes, the im­prove­ments the AI makes com­pound, whereas the im­prove­ments we make to AI don’t com­pound, but to me that’s the canon­i­cal case of con­tin­u­ous take­off, e.g. as de­scribed in Take­off speeds.) the “path in model space” that the NAS tra­verses is very differ­ent from all the paths that pre­vi­ous NASs tra­versed. This seems to me plau­si­ble even if the model space of the$1B NAS is iden­ti­cal to ones used in pre­vi­ous NASs (e.g. if differ­ent ran­dom seeds yield very differ­ent paths); and it seems to me even more plau­si­ble if the model space of the $1B NAS is slightly novel. In all the pre­vi­ous NASs, why did the paths taken pro­duce AI sys­tems that were so much worse than the one taken by the$1B NAS? Did the $1B NAS just get lucky? (Again, this re­ally sounds like a claim that “the path taken by NAS” is frag­ile.) Re­lat­edly, “mod­el­ing ML re­search as a lo­cal search in idea-space” is not nec­es­sar­ily con­tra­dic­tory to FOOM, if an im­por­tant part of that lo­cal search can be car­ried out with­out hu­man involvement If you want to make the case for a dis­con­ti­nu­ity be­cause of the lack of hu­man in­volve­ment, you would need to ar­gue: • The re­place­ment for hu­mans is way cheaper /​ faster /​ more effec­tive than hu­mans (in that case why wasn’t it au­to­mated ear­lier?) • The dis­con­ti­nu­ity hap­pens as soon as hu­mans are re­placed (oth­er­wise, the sys­tem-with­out-hu­man-in­volve­ment be­comes the new baseline, and all fu­ture sys­tems will look like rel­a­tively con­tin­u­ous im­prove­ments of this sys­tem) The sec­ond point definitely doesn’t ap­ply to NAS and meta-learn­ing, and I would ar­gue that the first point doesn’t ap­ply ei­ther, though that’s not ob­vi­ous. • What caused the re­searchers to go from “$1M run of NAS” to “$1B run of NAS”, with­out first try­ing “$10M run of NAS”? I es­pe­cially have this ques­tion if you’re mod­el­ing ML re­search as “trial and er­ror”;

I in­deed model a big part of con­tem­po­rary ML re­search as “trial and er­ror”. I agree that it seems un­likely that be­fore the first $1B NAS there won’t be any$10M NAS. Sup­pose there will even be a $100M NAS just be­fore the$1B NAS that (by as­sump­tion) re­sults in AGI. I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would serve as a fire alarm for AGI. Cur­rent AI sys­tems are very sub­hu­man, and throw­ing more money at NAS has led to rel­a­tively small im­prove­ments. Why don’t we ex­pect similar in­cre­men­tal im­prove­ments from the next 3-4 or­ders of mag­ni­tude of com­pute? If we look at the his­tory of deep learn­ing from ~1965 to 2019, how well do trend ex­trap­o­la­tion meth­ods fare in terms of pre­dict­ing perfor­mance gains for the next 3-4 or­ders of mag­ni­tude of com­pute? My best guess is that they don’t fare all that well. For ex­am­ple, based on data prior to 2011, I as­sume such meth­ods pre­dict mostly busi­ness-as-usual for deep learn­ing dur­ing 2011-2019 (i.e. com­pletely miss­ing the deep learn­ing rev­olu­tion). More gen­er­ally, when us­ing trend ex­trap­o­la­tions in AI, con­sider the fol­low­ing from this Open Phil blog post (2016) by Holden Karnofsky (foot­note 7): The most ex­haus­tive ret­ro­spec­tive anal­y­sis of his­tor­i­cal tech­nol­ogy fore­casts we have yet found, Mul­lins (2012), cat­e­go­rized thou­sands of pub­lished tech­nol­ogy fore­casts by method­ol­ogy, us­ing eight cat­e­gories in­clud­ing “mul­ti­ple meth­ods” as one cat­e­gory. [...] How­ever, when com­par­ing suc­cess rates for method­olo­gies solely within the com­puter tech­nol­ogy area tag, quan­ti­ta­tive trend anal­y­sis performs slight be­low av­er­age, (The link in the quote ap­pears to be bro­ken, here is one that works.) NAS seems to me like a good ex­am­ple for an ex­pen­sive com­pu­ta­tion that could plau­si­bly con­sti­tute a “search in idea-space” that finds an AGI model (with­out hu­man in­volve­ment). But my ar­gu­ment here ap­plies to any such com­pu­ta­tion. I think it may even ap­ply to a ‘$1B SGD’ (on a sin­gle huge net­work), if we con­sider a gra­di­ent up­date (or a se­quence thereof) to be an “ex­plo­ra­tion step in idea-space”.

Sup­pose that such a NAS did lead to hu­man-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did?

I first need to un­der­stand what “hu­man-level AGI” means. Can mod­els in this cat­e­gory pass strong ver­sions of the Tur­ing test? Does this cat­e­gory ex­clude sys­tems that out­perform hu­mans on one or more im­por­tant di­men­sions? (It seems to me that the first SGD-trained model that passes strong ver­sions of the Tur­ing test may be a su­per­in­tel­li­gence.)

In all the pre­vi­ous NASs, why did the paths taken pro­duce AI sys­tems that were so much worse than the one taken by the $1B NAS? Did the$1B NAS just get lucky?

Yes, the $1B NAS may in­deed just get lucky. A lo­cal search some­times gets lucky (in the sense of find­ing a lo­cal op­ti­mum that is a lot bet­ter than the ones found in most runs; not in the sense of mirac­u­lously start­ing the search at a great frag­ile solu­tion). [EDIT: also, some­thing about this NAS might be slightly novel—like the neu­ral ar­chi­tec­ture space.] If you want to make the case for a dis­con­ti­nu­ity be­cause of the lack of hu­man in­volve­ment, you would need to ar­gue: • The re­place­ment for hu­mans is way cheaper /​ faster /​ more effec­tive than hu­mans (in that case why wasn’t it au­to­mated ear­lier?) • The dis­con­ti­nu­ity hap­pens as soon as hu­mans are re­placed (oth­er­wise, the sys­tem-with­out-hu­man-in­volve­ment be­comes the new baseline, and all fu­ture sys­tems will look like rel­a­tively con­tin­u­ous im­prove­ments of this sys­tem) In some past cases where hu­mans did not serve any role in perfor­mance gains that were achieved with more com­pute/​data (e.g. train­ing GPT-2 by scal­ing up GPT), there were no hu­mans to re­place. So I don’t un­der­stand the ques­tion “why wasn’t it au­to­mated ear­lier?” In the sec­ond point, I need to first un­der­stand how you define that mo­ment in which “hu­mans are re­placed”. (In the$1B NAS sce­nario, would that mo­ment be the one in which the NAS is in­voked?)

• Meta: I feel like I am ar­gu­ing for “there will not be a dis­con­ti­nu­ity”, and you are in­ter­pret­ing me as ar­gu­ing for “we will not get AGI soon /​ AGI will not be trans­for­ma­tive”, nei­ther of which I be­lieve. (I have wide un­cer­tainty on timelines, and I cer­tainly think AGI will be trans­for­ma­tive.) I’d like you to state what po­si­tion you think I’m ar­gu­ing for, taboo­ing “dis­con­ti­nu­ity” (not the ar­gu­ments for it, just the po­si­tion).

I in­deed model a big part of con­tem­po­rary ML re­search as “trial and er­ror”. I agree that it seems un­likely that be­fore the first $1B NAS there won’t be any$10M NAS. Sup­pose there will even be a $100M NAS just be­fore the$1B NAS that (by as­sump­tion) re­sults in AGI. I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would serve as a fire alarm for AGI. I’m ar­gu­ing against FOOM, not about whether there will be a fire alarm. The fire alarm ques­tion seems or­thog­o­nal to me. I’m more un­cer­tain about the fire alarm ques­tion. quan­ti­ta­tive trend anal­y­sis performs slight be­low av­er­age [...] NAS seems to me like a good ex­am­ple for an ex­pen­sive com­pu­ta­tion that could plau­si­bly con­sti­tute a “search in idea-space” that finds an AGI model [...] it may even ap­ply to a ‘$1B SGD’ (on a sin­gle huge net­work) [...] the $1B NAS may in­deed just get lucky This sounds to me like say­ing “well, we can’t trust pre­dic­tions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I am not com­pel­led by ar­gu­ments that tell me to worry about sce­nario X with­out giv­ing me a rea­son to be­lieve that sce­nario X is likely. (Com­pare: “we can’t rule out the pos­si­bil­ity that the simu­la­tors want us to build a tower to the moon or else they’ll shut off the simu­la­tion, so we bet­ter get started on that moon tower.”) This is not to say the such sce­nario X’s must be false—re­al­ity could be that way—but that given my limited amount of time, I must pri­ori­tize which sce­nar­ios to pay at­ten­tion to, and one re­ally good heuris­tic for that is to fo­cus on sce­nar­ios that have some in­side-view rea­son that makes me think they are likely. If I had in­finite time, I’d even­tu­ally con­sider these sce­nar­ios (even the simu­la­tors want­ing us to build a moon tower hy­poth­e­sis). Some other more tan­gen­tial things: If we look at the his­tory of deep learn­ing from ~1965 to 2019, how well do trend ex­trap­o­la­tion meth­ods fare in terms of pre­dict­ing perfor­mance gains for the next 3-4 or­ders of mag­ni­tude of com­pute? My best guess is that they don’t fare all that well. For ex­am­ple, based on data prior to 2011, I as­sume such meth­ods pre­dict mostly busi­ness-as-usual for deep learn­ing dur­ing 2011-2019 (i.e. com­pletely miss­ing the deep learn­ing rev­olu­tion). The trend that changed in 2012 was that of the amount of com­pute ap­plied to deep learn­ing. I sus­pect trend ex­trap­o­la­tion with com­pute as the x-axis would do okay; trend ex­trap­o­la­tion with cal­en­dar year as the x-axis would do poorly. But as I men­tioned above, this is not a crux for me, since it doesn’t give me an in­side-view rea­son to ex­pect FOOM; I wouldn’t even con­sider it weak ev­i­dence for FOOM if I changed my mind on this. (If the data showed a big dis­con­ti­nu­ity, that would be ev­i­dence, but I’m fairly con­fi­dent that while there was a dis­con­ti­nu­ity it was rel­a­tively small.) • I’d like you to state what po­si­tion you think I’m ar­gu­ing for I think you’re ar­gu­ing for some­thing like: Con­di­tioned on [the first AGI is cre­ated at time by AI lab X], it is very un­likely that im­me­di­ately be­fore the re­searchers at X have a very low cre­dence in the propo­si­tion “we will cre­ate an AGI some­time in the next 30 days”. (Tbc, I did not in­ter­pret you as ar­gu­ing about timelines or AGI trans­for­ma­tive­ness; and nei­ther did I ar­gue about those things here.) I’m ar­gu­ing against FOOM, not about whether there will be a fire alarm. The fire alarm ques­tion seems or­thog­o­nal to me. Us­ing the “fire alarm” con­cept here was a mis­take, sorry for that. In­stead of writ­ing: I’m pretty ag­nos­tic about whether the re­sult of that$100M NAS would serve as a fire alarm for AGI.

I should have writ­ten:

I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would be “al­most AGI”. This sounds to me like say­ing “well, we can’t trust pre­dic­tions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I gen­er­ally have a vague im­pres­sion that many AIS/​x-risk peo­ple tend to place too much weight on trend ex­trap­o­la­tion ar­gu­ments in AI (or tend to not give enough at­ten­tion to im­por­tant de­tails of such ar­gu­ments), which may have trig­gered me to write the re­lated stuff (in re­sponse to you seem­ingly ap­ply­ing a trend ex­trap­o­la­tion ar­gu­ment with re­spect to NAS). I was not list­ing the rea­sons for my be­liefs speci­fi­cally about NAS. If I had in­finite time, I’d even­tu­ally con­sider these sce­nar­ios (even the simu­la­tors want­ing us to build a moon tower hy­poth­e­sis). (I’m mind­ful of your time and so I don’t want to branch out this dis­cus­sion into un­re­lated top­ics, but since this seems to me like a po­ten­tially im­por­tant point...) Even if we did have in­finite time and the abil­ity to some­how de­ter­mine the cor­rect­ness of any given hy­poth­e­sis with su­per-high-con­fi­dence, we may not want to eval­u­ate all hy­pothe­ses—that in­volve other agents—in ar­bi­trary or­der. Due to game the­o­ret­i­cal stuff, the or­der in which we do things may mat­ter (e.g. due to com­mit­ment races in log­i­cal time). For ex­am­ple, af­ter con­sid­er­ing some game-the­o­ret­i­cal meta con­sid­er­a­tions we might de­cide to make cer­tain bind­ing com­mit­ments be­fore eval­u­at­ing such and such hy­pothe­ses; or we might de­cide about what ad­di­tional things we should con­sider or do be­fore eval­u­at­ing some other hy­pothe­ses, etcetera. Con­di­tioned on the first AGI be­ing al­igned, it may be im­por­tant to figure out how do we make sure that that AGI “be­haves wisely” with re­spect to this topic (be­cause the AGI might be able to eval­u­ate a lot of weird hy­pothe­ses that we can’t). • Due to game the­o­ret­i­cal stuff, the or­der in which we do things may mat­ter (e.g. due to com­mit­ment races in log­i­cal time). Can you give me an ex­am­ple? I don’t see how this would work. (Tbc, I’m imag­in­ing that the uni­verse stops, and only I con­tinue think­ing; there are no other agents think­ing while I’m think­ing, and so afaict I should just im­ple­ment UDT.) • Creat­ing some sort of com­mit­ment de­vice that would bind us to fol­low UDT—be­fore we eval­u­ate some set of hy­pothe­ses—is an ex­am­ple for one po­ten­tially con­se­quen­tial in­ter­ven­tion. As an aside, my un­der­stand­ing is that in en­vi­ron­ments that in­volve mul­ti­ple UDT agents, UDT doesn’t nec­es­sar­ily work well (or is not even well-defined?). Also, if we would use SGD to train a model that ends up be­ing an al­igned AGI, maybe we should figure out how to make sure that that model “fol­lows” a good de­ci­sion the­ory. (Or does this hap­pen by de­fault? Does it de­pend on whether “fol­low­ing a good de­ci­sion the­ory” is helpful for min­i­miz­ing ex­pected loss on the train­ing set?) • Con­di­tioned on [the first AGI is cre­ated at time t by AI lab X], it is very un­likely that im­me­di­ately be­fore t the re­searchers at X have a very low cre­dence in the propo­si­tion “we will cre­ate an AGI some­time in the next 30 days”. It wasn’t ex­actly that (in par­tic­u­lar, I didn’t have the re­searcher’s be­liefs in mind), but I also be­lieve that state­ment for ba­si­cally the same rea­sons so that should be fine. There’s a lot of am­bi­guity in that state­ment (speci­fi­cally, what is AGI), but I prob­a­bly be­lieve it for most op­er­a­tional­iza­tions of AGI. (For refer­ence, I was con­sid­er­ing “will there be a 1 year dou­bling of eco­nomic out­put that started be­fore the first 4 year dou­bling of eco­nomic out­put ended”; for that it’s not suffi­cient to just ar­gue that we will get AGI sud­denly, you also have to ar­gue that the AGI will very quickly be­come su­per­in­tel­li­gent enough to dou­ble eco­nomic out­put in a very short amount of time.) I’m pretty ag­nos­tic about whether the re­sult of that$100M NAS would be “al­most AGI”.

I mean, the differ­ence be­tween a $100M NAS and a$1B NAS is:

• Up to 10x the num­ber of mod­els evaluated

• Up to 10x the size of mod­els evaluated

If you in­crease the num­ber of mod­els by 10x and leave the size the same, that some­what in­creases your op­ti­miza­tion power. If you model the NAS as pick­ing ar­chi­tec­tures ran­domly, the $1B NAS can have at most 10x the chance of find­ing AGI, re­gard­less of frag­ility, and so can only have at most 10x the ex­pected “value” (what­ever your no­tion of “value”). If you then also model ar­chi­tec­tures as non-frag­ile, then once you have some op­ti­miza­tion power, adding more op­ti­miza­tion power doesn’t make much of a differ­ence, e.g. the max of n draws from Uniform([0, 1]) has ex­pected value , so once n is already large (e.g. 100), in­creas­ing it makes ~no differ­ence. Of course, our ac­tual dis­tri­bu­tions will prob­a­bly be more bot­tom-heavy, but as dis­tri­bu­tions get more bot­tom-heavy we use gra­di­ent de­scent /​ evolu­tion­ary search to deal with that. For the size, it’s pos­si­ble that in­creases in size lead to huge in­creases in in­tel­li­gence, but that doesn’t seem to agree with ML prac­tice so far. Even if you ig­nore trend ex­trap­o­la­tion, I don’t see a rea­son to ex­pect that in­creas­ing model sizes should mean the differ­ence be­tween not-even-close-to-AGI and AGI. • If you model the NAS as pick­ing ar­chi­tec­tures randomly I don’t. NAS can be done with RL or evolu­tion­ary com­pu­ta­tion meth­ods. (Tbc, when I said I model a big part of con­tem­po­rary ML re­search as “trial and er­ror”, by trial and er­ror I did not mean ran­dom search.) If you then also model ar­chi­tec­tures as non-frag­ile, then once you have some op­ti­miza­tion power, adding more op­ti­miza­tion power doesn’t make much of a differ­ence, Ear­lier in this dis­cus­sion you defined frag­ility as the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”. While find­ing frag­ile solu­tions is hard, find­ing non-frag­ile solu­tion is not nec­es­sar­ily easy, so I don’t fol­low the logic of that para­graph. Sup­pose that all model ar­chi­tec­tures are in­deed non-frag­ile, and some of them can im­ple­ment AGI (call them “AGI ar­chi­tec­tures”). It may be the case that rel­a­tive to the set of model ar­chi­tec­tures that we can end up with when us­ing our fa­vorite method (e.g. evolu­tion­ary search), the AGI ar­chi­tec­tures are a tiny sub­set. E.g. the size ra­tio can be (and then run­ning our evolu­tion­ary search 10x times means roughly 10x prob­a­bil­ity of find­ing an AGI ar­chi­tec­ture, if [num­ber of runs]<<). • I don’t. NAS can be done with RL or evolu­tion­ary com­pu­ta­tion meth­ods. (Tbc, when I said I model a big part of con­tem­po­rary ML re­search as “trial and er­ror”, by trial and er­ror I did not mean ran­dom search.) I do think that similar con­clu­sions ap­ply there as well, though I’m not go­ing to make a math­e­mat­i­cal model for it. find­ing non-frag­ile solu­tion is not nec­es­sar­ily easy I’m not say­ing it is; I’m say­ing that how­ever hard it is to find a non-frag­ile good solu­tion, it is eas­ier to find a solu­tion that is al­most as good. When I say adding more op­ti­miza­tion power doesn’t make much of a difference I mean to im­ply that the ex­ist­ing op­ti­miza­tion power will do most of the work, for what­ever qual­ity of solu­tion you are get­ting. Sup­pose that all model ar­chi­tec­tures are in­deed non-frag­ile, and some of them can im­ple­ment AGI (call them “AGI ar­chi­tec­tures”). It may be the case that rel­a­tive to the set of model ar­chi­tec­tures that we can end up with when us­ing our fa­vorite method (e.g. evolu­tion­ary search), the AGI ar­chi­tec­tures are a tiny sub­set. E.g. the size ra­tio can be 10−10(and then run­ning our evolu­tion­ary search 10x times means roughly 10x prob­a­bil­ity of find­ing an AGI ar­chi­tec­ture, if [num­ber of runs]<<1010). (Aside: it would be way smaller than .) In this sce­nario, my ar­gu­ment is that the size ra­tio for “al­most-AGI ar­chi­tec­tures” is bet­ter (e.g. ), and so you’re more likely to find one of those first. In prac­tice, if you have a thou­sand pa­ram­e­ters that de­ter­mine an ar­chi­tec­ture, and 10 set­tings for each of them, the size ra­tio for the (as­sumed unique) globally best ar­chi­tec­ture is . In this set­ting, I ex­pect sev­eral or­ders of mag­ni­tude of differ­ence be­tween the size ra­tio of al­most-AGI and the size ra­tio of AGI, mak­ing it es­sen­tially guaran­teed that you find an al­most-AGI ar­chi­tec­ture be­fore an AGI ar­chi­tec­ture. • In this sce­nario, my ar­gu­ment is that the size ra­tio for “al­most-AGI ar­chi­tec­tures” is bet­ter (e.g. ), and so you’re more likely to find one of those first. For a “lo­cal search NAS” (rather than “ran­dom search NAS”) it seems that we should be con­sid­er­ing here the set of [“al­most-AGI ar­chi­tec­tures” from which the lo­cal search would not find an “AGI ar­chi­tec­ture”]. The “$1B NAS dis­con­ti­nu­ity sce­nario” al­lows for the $1B NAS to find “al­most-AGI ar­chi­tec­tures” be­fore find­ing an “AGI ar­chi­tec­ture”. • For a “lo­cal search NAS” (rather than “ran­dom search NAS”) it seems that we should be con­sid­er­ing here the set of [“al­most-AGI ar­chi­tec­tures” from which the lo­cal search would not find an “AGI ar­chi­tec­ture”]. The “$1B NAS dis­con­ti­nu­ity sce­nario” al­lows for the $1B NAS to find “al­most-AGI ar­chi­tec­tures” be­fore find­ing an “AGI ar­chi­tec­ture”. Agreed. My point is that the$100M NAS would find the al­most-AGI ar­chi­tec­tures. (My point with the size ra­tios is that what­ever crite­rion you use to say “and that’s why the $1B NAS finds AGI while the$100M NAS doesn’t”, my re­sponse would be that “well, al­most-AGI ar­chi­tec­tures re­quire a slightly eas­ier-to-achieve value of <crite­rion>, that the \$100M NAS would have achieved”.)

• I’ve seen the “ML gets de­ployed care­lessly” nar­ra­tive pop up on LW a bunch, and while it does seem ac­cu­rate in many cases, I wanted to note that there are counter-ex­am­ples. The most promi­nent counter-ex­am­ple I’m aware of is the in­cred­ibly cau­tious ap­proach Deep­Mind/​Google took when de­sign­ing the ML sys­tem that cools Google’s dat­a­cen­ters.

• This seems to be care­ful de­ploy­ment. The con­cept of de­ploy­ment is go­ing from an AI in the lab, to the same AI in con­trol of a real world sys­tem. Sup­pose your de­sign pro­cess was to fid­dle around in the lab un­til you make some­thing that seems to work. Once you have that, you look at it to un­der­stand why it works. You try to prove the­o­rems about it. You sub­ject it to some ex­ten­sive bat­tery of test­ing and will only put it in a self driv­ing car/​ data cen­ter cool­ing sys­tem once you are con­fi­dent it is safe.

There are two places this could fail. Your test­ing pro­ce­dures could be in­suffi­cient, or your AI could hack out of the lab be­fore the test­ing starts. I see lit­tle to no defense against the lat­ter.

• Would it be fair to sum­ma­rize your view here as “As­sum­ing no foom, we’ll be able to iter­ate, and that’s prob­a­bly enough.”?

• Hmm, I think I’d want to ex­plic­itly in­clude two other points, that are kind of in­cluded in that but don’t get com­mu­ni­cated well by that sum­mary:

• There may not be a prob­lem at all; per­haps by de­fault pow­er­ful AI sys­tems are not goal-di­rected.

• If there is a prob­lem, we’ll get ev­i­dence of its ex­is­tence be­fore it’s too late, and co­or­di­na­tion to not build prob­le­matic AI sys­tems will buy us ad­di­tional time.

• Cool, just wanted to make sure I’m en­gag­ing with the main ar­gu­ment here. With that out of the way...

• I gen­er­ally buy the “no foom ⇒ iter­ate ⇒ prob­a­bly ok” sce­nario. There are some caveats and qual­ifi­ca­tions, but broadly-defined “no foom” is a crux for me—I ex­pect at least some kind of de­ci­sive strate­gic ad­van­tage for early AGI, and would find the “al­igned by de­fault” sce­nario plau­si­ble in a no-foom world.

• I do not think that a lack of goal-di­rect­ed­ness is par­tic­u­larly rele­vant here. If an AI has ex­treme ca­pa­bil­ities, then a lack of goals doesn’t re­ally make it any safer. At some point I’ll prob­a­bly write a post about Don Nor­man’s fridge which talks about this in more depth, but the short ver­sion is: if we have an AI with ex­treme ca­pa­bil­ities but a con­fus­ing in­ter­face, then there’s a high chance that we all die, goal-di­rec­tion or not. In the “no foom” sce­nario, we’re as­sum­ing the AI won’t have those ex­treme ca­pa­bil­ities, but it’s foom vs no foom which mat­ters there, not goals vs no goals.

• I also dis­agree with co­or­di­na­tion hav­ing any hope what­so­ever if there is a prob­lem. There’s a huge unilat­er­al­ist prob­lem there, with mil­lions of peo­ple each eas­ily able to push the shiny red but­ton. I think straight-up solv­ing all of the tech­ni­cal al­ign­ment prob­lems would be much eas­ier than that co­or­di­na­tion prob­lem.

Look­ing at both the first and third point, I sus­pect that a sub-crux might be ex­pec­ta­tions about the re­source re­quire­ments (i.e. com­pute & data) needed for AGI. I ex­pect that, once we have the key con­cepts, hu­man-level AGI will be able to run in re­al­time on an or­di­nary lap­top. (Train­ing might re­quire more re­sources, at least early on. That would re­duce the unilat­er­al­ist prob­lem, but in­crease the chance of de­ci­sive strate­gic ad­van­tage due to the higher bar­rier to en­try.)

EDIT: to clar­ify, those sec­ond two points are both con­di­tioned on foom. Point be­ing, the only thing which ac­tu­ally mat­ters here is foom vs no foom:

• if there’s no foom, then we can prob­a­bly iter­ate, and then we’re prob­a­bly fine any­way (re­gard­less of goal-di­rec­tion, co­or­di­na­tion, etc).

• if there’s foom, then a lack of goal-di­rec­tion won’t help much, and co­or­di­na­tion is un­likely to work.

• the only thing which ac­tu­ally mat­ters here is foom vs no foom

Yeah, I think I mostly agree with this.

if we have an AI with ex­treme ca­pa­bil­ities but a con­fus­ing in­ter­face, then there’s a high chance that we all die

Yeah, I agree with that (as­sum­ing “ex­treme ca­pa­bil­ities” = re­ar­rang­ing atoms how­ever it sees fit, or some­thing of that na­ture), but why must it have a con­fus­ing in­ter­face? Couldn’t you just talk to it, and it would know what you mean? So I do think the goal-di­rected point does mat­ter.

I sus­pect that a sub-crux might be ex­pec­ta­tions about the re­source re­quire­ments (i.e. com­pute & data) needed for AGI. I ex­pect that, once we have the key con­cepts, hu­man-level AGI will be able to run in re­al­time on an or­di­nary lap­top.

I agree that this is a sub-crux. Note that I be­lieve that even­tu­ally hu­man-level AGI will be able to run on a lap­top, just that it will be pre­ceded by hu­man-level AGIs that take more com­pute.

Train­ing might re­quire more re­sources, at least early on. That would re­duce the unilat­er­al­ist prob­lem, but in­crease the chance of de­ci­sive strate­gic ad­van­tage due to the higher bar­rier to en­try.

I tend to think that if prob­lems arise, you’ve mostly lost already, so I’m ac­tu­ally hap­pier about de­ci­sive strate­gic ad­van­tage be­cause it re­duces com­pet­i­tive pres­sure.

But tbc, I broadly agree with all of your points, and do think that in FOOM wor­lds most of my ar­gu­ments don’t work. (Though I con­tinue to be con­fused what ex­actly a FOOM world looks like.)

• but why must it have a con­fus­ing in­ter­face? Couldn’t you just talk to it, and it would know what you mean?

That’s where the Don Nor­man part comes in. In­ter­faces to com­pli­cated sys­tems are con­fus­ing by de­fault. The gen­eral prob­lem of sys­tem­at­i­cally build­ing non-con­fus­ing in­ter­faces is, in my mind at least, roughly equiv­a­lent to the full tech­ni­cal prob­lem of AI al­ign­ment. (Writ­ing a pro­gram which knows what you mean is also, in my mind, roughly equiv­a­lent to the full tech­ni­cal prob­lem of AI al­ign­ment.) A word­ing which makes it more ob­vi­ous:

• The main prob­lem of AI al­ign­ment is to trans­late what a hu­man wants into a for­mat us­able by a machine

• The main prob­lem of user in­ter­face de­sign is to help/​al­low a hu­man to trans­late what they want into a for­mat us­able by a machine

Some­thing like e.g. tool AI puts more of the trans­la­tion bur­den on the hu­man, rather than on the AI, but that doesn’t make the trans­la­tion it­self any less difficult.

In a non-foomy world, the trans­la­tion doesn’t have to be perfect—hu­man­ity won’t be wiped out if the AI doesn’t quite perfectly un­der­stand what we mean. Ex­treme ca­pa­bil­ities make high-qual­ity trans­la­tion more im­por­tant, not just be­cause of Good­hart, but be­cause the trans­la­tion it­self will break down in sce­nar­ios very differ­ent from what hu­mans are used to. So if the AI has the ca­pa­bil­ities to achieve sce­nar­ios very differ­ent from what hu­mans are used to, then that trans­la­tion needs to be quite good.

• Do you agree that an AI with ex­treme ca­pa­bil­ities should know what you mean, even if it doesn’t act in ac­cor­dance with it? (This seems like an im­pli­ca­tion of “ex­treme ca­pa­bil­ities”.)

• No. The whole no­tion of a hu­man “mean­ing things” pre­sumes a cer­tain level of ab­strac­tion. One could imag­ine an AI sim­ply rea­son­ing about molecules or fields (or at least in­di­vi­d­ual neu­rons), with­out hav­ing any need for view­ing cer­tain chunks of mat­ter as hu­mans who mean things. In prin­ci­ple, no pre­dic­tive power what­so­ever would be lost in that view of the world.

That said, I do think that prob­lem is less cen­tral/​im­me­di­ate than the prob­lem of tak­ing an AI which does know what we mean, and point­ing at that AI’s con­cept-of-what-we-mean—i.e. in or­der to pro­gram the AI to do what we mean. Even if an AI learns a con­cept of hu­man val­ues, we still need to be able to point to that con­cept within the AI’s con­cept-space in or­der to ac­tu­ally al­ign it—and that means trans­lat­ing be­tween AI-no­tion-of-what-we-want and our-no­tion-of-what-we-want.

• That’s the crux for me; I ex­pect AI sys­tems that we build to be ca­pa­ble of “know­ing what you mean” (us­ing the ap­pro­pri­ate level of ab­strac­tion). They may also use other lev­els of ab­strac­tion, but I ex­pect them to be ca­pa­ble of us­ing that one.

Even if an AI learns a con­cept of hu­man val­ues, we still need to be able to point to that con­cept within the AI’s con­cept-space in or­der to ac­tu­ally al­ign it

Yes, I would call that the cen­tral prob­lem. (Though it would also be fine to build a poin­ter to a hu­man and have the AI “help the hu­man”, with­out nec­es­sar­ily point­ing to hu­man val­ues.)

• Yes, I would call that the cen­tral prob­lem. (Though it would also be fine to build a poin­ter to a hu­man and have the AI “help the hu­man”, with­out nec­es­sar­ily point­ing to hu­man val­ues.)

How would we do ei­ther of those things with­out work­able the­ory of em­bed­ded agency, ab­strac­tion, some idea of what kind-of-struc­ture hu­man val­ues have, etc?

• If you wanted a prov­able guaran­tee be­fore pow­er­ful AI sys­tems are ac­tu­ally built, you prob­a­bly can’t do it with­out the things you listed.

I’m claiming that as we get pow­er­ful AI sys­tems, we could figure out tech­niques that work with those AI sys­tems. They only ini­tially need to work for AI sys­tems that are around our level of in­tel­li­gence, and then we can im­prove our tech­niques in tan­dem with the AI sys­tems gain­ing in­tel­li­gence. In that set­ting, I’m rel­a­tively op­ti­mistic about things like “just train the AI to fol­low your in­struc­tions”; while this will break down in ex­otic cases or as the AI scales up, those cases are rare and hard to find.

• I’m not re­ally think­ing about prov­able guaran­tees per se. I’m just think­ing about how to point to the AI’s con­cept of hu­man val­ues—di­rectly point to it, not point to some proxy of it, be­cause prox­ies break down etc.

(Rough heuris­tic here: it is not pos­si­ble to point di­rectly at an ab­stract ob­ject in the ter­ri­tory. Even though a ter­ri­tory of­ten sup­ports cer­tain nat­u­ral ab­strac­tions, which are in­stru­men­tally con­ver­gent to learn/​use, we still can’t un­am­bigu­ously point to that ab­strac­tion in the ter­ri­tory—only in the map.)

A proxy is prob­a­bly good enough for a lot of ap­pli­ca­tions with lit­tle scale and few cor­ner cases. And if we’re do­ing some­thing like “train the AI to fol­low your in­struc­tions”, then a proxy is ex­actly what we’ll get. But if you want, say, an AI which “tries to help”—as op­posed to e.g. an AI which tries to look like it’s helping—then that means point­ing di­rectly to hu­man val­ues, not to a proxy.

Now, it is pos­si­ble that we could train an AI against a proxy, and it would end up point­ing to ac­tual hu­man val­ues in­stead, sim­ply due to im­perfect op­ti­miza­tion dur­ing train­ing. I think that’s what you have in mind, and I do think it’s plau­si­ble, even if sounds a bit crazy. Of course, with­out bet­ter the­o­ret­i­cal tools, we still wouldn’t have a way to di­rectly check even in hind­sight whether the AI ac­tu­ally wound up point­ing to hu­man val­ues or not. (Again, not talk­ing about prov­able guaran­tees here, I just want to be able to look at the AI’s own in­ter­nal data struc­tures and figure out (a) whether it has a no­tion of hu­man val­ues, and (b) whether it’s ac­tu­ally try­ing to act in ac­cor­dance with them, or just some­thing cor­re­lated with them.)

• it is pos­si­ble that we could train an AI against a proxy, and it would end up point­ing to ac­tual hu­man val­ues in­stead, sim­ply due to im­perfect op­ti­miza­tion dur­ing train­ing. I think that’s what you have in mind

Kind of, but not ex­actly.

I think that what­ever proxy is learned will not be a perfect poin­ter. I don’t know if there is such a thing as a “perfect poin­ter”, given that I don’t think there is a “right” an­swer to the ques­tion of what hu­man val­ues are, and con­se­quently I don’t think there is a “right” an­swer to what is helpful vs. not helpful.

I think the learned proxy will be a good enough poin­ter that the agent will not be ac­tively try­ing to kill us all, will let us cor­rect it, and will gen­er­ally do use­ful things. It seems likely that if the agent was mag­i­cally scaled up a lot, then bad things could hap­pen due to the er­rors in the poin­ter. But I’d hope that as the agent scales up, we im­prove and cor­rect the poin­ter (where “we” doesn’t have to be just hu­mans; it could also in­clude other AI as­sis­tants).

• It seems that the in­ter­vie­wees here ei­ther:

1. Use “AI risk” in a nar­rower way than I do.

2. Ne­glected to con­sider some sources/​forms of AI risk (see above link).

3. Have con­sid­ered other sources/​forms of AI risk but do not find them worth ad­dress­ing.

4. Are wor­ried about other sources/​forms of AI risk but they weren’t brought up dur­ing the in­ter­views.

Can you talk about which of these is the case for your­self (Ro­hin) and for any­one else whose think­ing you’re fa­mil­iar with? (Or if any of the other in­ter­vie­wees would like to chime in for them­selves?)

• For con­text, here’s the one time in the in­ter­view I men­tion “AI risk” (quot­ing 2 ear­lier para­graphs for con­text):

Paul Chris­ti­ano: I don’t know, the fu­ture is 10% worse than it would oth­er­wise be in ex­pec­ta­tion by virtue of our failure to al­ign AI. I made up 10%, it’s kind of a ran­dom num­ber. I don’t know, it’s less than 50%. It’s more than 10% con­di­tioned on AI soon I think.
[...]
Asya Ber­gal: I think my im­pres­sion is that that 10% is lower than some large set of peo­ple. I don’t know if other peo­ple agree with that.
Paul Chris­ti­ano: Cer­tainly, 10% is lower than lots of peo­ple who care about AI risk. I mean it’s worth say­ing, that I have this slightly nar­row con­cep­tion of what is the al­ign­ment prob­lem. I’m not in­clud­ing all AI risk in the 10%. I’m not in­clud­ing in some sense most of the things peo­ple nor­mally worry about and just in­clud­ing the like ‘we tried to build an AI that was do­ing what we want but then it wasn’t even try­ing to do what we want’. I think it’s lower now or even af­ter that caveat, than pes­simistic peo­ple. It’s go­ing to be lower than all the MIRI folks, it’s go­ing to be higher than al­most ev­ery­one in the world at large, es­pe­cially af­ter spe­cial­iz­ing in this prob­lem, which is a prob­lem al­most no one cares about, which is pre­cisely how a thou­sand full time peo­ple for 20 years can re­duce the whole risk by half or some­thing.

(But it’s still the case that asked “Can you ex­plain why it’s valuable to work on AI risk?” I re­sponded by al­most en­tirely talk­ing about AI al­ign­ment, since that’s what I work on and the kind of work where I have a strong view about cost-effec­tive­ness.)

• We dis­cussed this here for my in­ter­view; my an­swer is the same as it was then (ba­si­cally a com­bi­na­tion of 3 and 4). I don’t know about the other in­ter­vie­wees.

• I would guess that AI sys­tems will be­come more in­ter­pretable in the fu­ture, as they start us­ing the fea­tures /​ con­cepts /​ ab­strac­tions that hu­mans are us­ing.

This sort of rea­son­ing seems to as­sume that ab­strac­tion space is 1 di­men­sional, so AI must use hu­man con­cepts on the path from sub­hu­man to su­per­hu­man. I dis­agree. Like most things we don’t have strong rea­son to think is 1D, and which take many bits of info to de­scribe, ab­strac­tions seem high di­men­sional. So on the path from sub­hu­man to su­per­hu­man, the AI must use ab­strac­tions that are as pred­ica­tively use­ful as hu­man ab­strac­tions. Th­ese will not be any­thing like hu­man ab­strac­tions un­less the sys­tem was de­signed from a de­tailed neu­rolog­i­cal model of hu­mans. Any AI that hu­mans can rea­son about us­ing our in­built em­pa­thetic rea­son­ing is ba­si­cally a mind up­load, or a mind that differs from hu­man less than hu­mans differ from each other. This is not what ML will cre­ate. Hu­man un­der­stand­ing of AI sys­tems will have to be by ab­stract math­e­mat­i­cal rea­son­ing, the way we un­der­stand for­mal maths. Em­pa­thetic rea­son­ing about hu­man level AI is just ask­ing for an­thro­po­mor­phism. Our 3 op­tions are

1) An AI we don’t understand

2) An AI we can rea­son about in terms of maths.

3) A vir­tual hu­man.

• While I might agree with the three op­tions at the bot­tom, I don’t agree with the rea­son­ing to get there.

Ab­strac­tions are pretty heav­ily de­ter­mined by the ter­ri­tory. Hu­mans didn’t look at the world and pick out “tree” as an ab­stract con­cept be­cause of a bunch of hu­man-spe­cific fac­tors. “Tree” is a re­cur­ring pat­tern on earth, and even aliens would no­tice that same cluster of things, as­sum­ing they paid at­ten­tion. Even on the em­pathic front, you don’t need a hu­man-like mind in or­der to no­tice the com­mon pat­terns of hu­man be­hav­ior (in hu­mans) which we call “anger” or “sad­ness”.

• Ab­strac­tions are pretty heav­ily de­ter­mined by the ter­ri­tory.

+1, that’s my re­sponse as well.

• Some ab­strac­tions are heav­ily de­ter­mined by the ter­ri­tory. The con­cept of trees is pretty heav­ily de­ter­mined by the ter­ri­tory. Whereas the con­cept of be­trayal is de­ter­mined by the way that hu­man minds func­tion, which is de­ter­mined by other peo­ple’s ab­strac­tions. So while it seems rea­son­ably likely to me that an AI “nat­u­rally thinks” in terms of the same low-level ab­strac­tions as hu­mans, it think­ing in terms of hu­man high-level ab­strac­tions seems much less likely, ab­sent some type of safety in­ter­ven­tion. Which is par­tic­u­larly im­por­tant be­cause most of the key hu­man val­ues are very high-level ab­strac­tions.

• My guess is that if you have to deal with hu­mans, as at least early AI sys­tems will have to do, then ab­strac­tions like “be­trayal” are heav­ily de­ter­mined.

I agree that if you don’t have to deal with hu­mans, then things like “be­trayal” may not arise; similarly if you don’t have to deal with Earth, then “trees” are not heav­ily de­ter­mined ab­strac­tions.

• Neu­ral nets have around hu­man perfor­mance on Ima­genet.

If ab­strac­tion was a fea­ture of the ter­ri­tory, I would ex­pect the failure cases to be similar to hu­man failure cases. Look­ing at https://​​github.com/​​hendrycks/​​nat­u­ral-adv-ex­am­ples, This does not seem to be the case very strongly, but then again, some of them con­tain dark shiny stone be­ing clas­sified as a sea lion. The failures aren’t to­tally in­hu­man, the way they are with ad­ver­sar­ial ex­am­ples.

Hu­mans didn’t look at the world and pick out “tree” as an ab­stract con­cept be­cause of a bunch of hu­man-spe­cific fac­tors.

I am not say­ing that trees aren’t a cluster in thing space. What I am say­ing is that if there were many cluster in thing space that were as tight and pred­ica­tively use­ful as “Tree”, but were not pos­si­ble for hu­mans to con­cep­tu­al­ize, we wouldn’t know it. There are plenty of con­cepts that hu­mans didn’t de­velop for most of hu­man his­tory, de­spite those con­cepts be­ing pred­ica­tively use­ful, un­til an odd ge­nius came along or the con­cept was pinned down by mas­sive ex­per­i­men­tal ev­i­dence. Eg in­clu­sive ge­netic fit­ness, en­tropy ect.

Con­sider that evolu­tion op­ti­mized us in an en­vi­ron­ment that con­tained trees, and in which pre­dict­ing them was use­ful, so it would be more sur­pris­ing for there to be a con­cept that is use­ful in the an­ces­tral en­vi­ron­ment that we can’t un­der­stand, than a con­cept that we can’t un­der­stand in a non an­ces­tral do­main.

This looks like a map that is heav­ily de­ter­mined by the ter­ri­tory, but hu­man maps con­tain rivers and not ge­olog­i­cal rock for­ma­tions. There could be fea­tures that could be mapped that hu­mans don’t map.

If you be­lieve the post that

Even­tu­ally, suffi­ciently in­tel­li­gent AI sys­tems will prob­a­bly find even bet­ter con­cepts that are alien to us,

Then you can form an equally good, non­hu­man con­cept by tak­ing the bet­ter alien con­cept and adding ran­dom noise. Of course, an AI trained on text might share our con­cepts just be­cause our con­cepts are the most pred­ica­tively use­ful ways to pre­dict our writ­ing. I would also like to as­sign some prob­a­bil­ity to AI sys­tems that don’t use any­thing rec­og­niz­able as a con­cept. You might be able to say 90% of blue ob­jects are egg shaped, 95% of cubes are red … 80% of furred ob­jects that glow in the dark are flex­ible … with­out ever split­ting ob­jects into bleggs and rubes. Seen from this per­spec­tive, you have a den­sity func­tion over thingspace, and a sum of clusters might not be the best way to de­scribe it. AIXI never talks about trees, it just simu­lates ev­ery quan­tum. Maybe there are fast al­gorithms that don’t even as­cribe dis­crete con­cepts.

• Neu­ral nets have around hu­man perfor­mance on Ima­genet.

But those trained neu­ral nets are very sub­hu­man on other image un­der­stand­ing tasks.

Then you can form an equally good, non­hu­man con­cept by tak­ing the bet­ter alien con­cept and adding ran­dom noise.

I would ex­pect that the alien con­cepts are some­thing we haven’t figured out be­cause we don’t have enough data or com­pute or logic or some other re­source, and that con­straint will also ap­ply to the AI. If you take that con­cept and “add ran­dom noise” (which I don’t re­ally un­der­stand), it would pre­sum­ably still re­quire the same amount of re­sources, and so the AI still won’t find it.

For the rest of your com­ment, I agree that we can’t the­o­ret­i­cally rule those sce­nar­ios out, but there’s no the­o­ret­i­cal rea­son to rule them in ei­ther. So far the em­piri­cal ev­i­dence seems to me to be in fa­vor of “ab­strac­tions are de­ter­mined by the ter­ri­tory”, e.g. ImageNet neu­ral nets seems to have hu­man-in­ter­pretable low-level ab­strac­tions (edge de­tec­tors, curve de­tec­tors, color de­tec­tors), while hav­ing strange high-level ab­strac­tions; I claim that the strange high-level ab­strac­tions are bad and only work on ImageNet be­cause they were speci­fi­cally de­signed to do so and ImageNet is suffi­ciently nar­row that you can get to good perfor­mance with bad ab­strac­tions.

• By adding ran­dom noise, I meant adding wig­gles to the edge of the set in thingspace for ex­am­ple adding noise to “bird” might ex­clude “os­trich” and in­clude “duck bill platy­pus”.

I agree that the high level image net con­cepts are bad in this sense, how­ever are they just bad. If they were just bad and the limit to find­ing good con­cepts was data or some other re­source, then we should ex­pect small chil­dren and men­tally im­paired peo­ple to have similarly bad con­cepts. This would sug­gest a sin­gle gra­di­ent from bet­ter to worse. If how­ever cur­rent neu­ral net­works used con­cepts sub­stan­tially differ­ent from small chil­dren, and not just uniformly worse or uniformly bet­ter, that would show differ­ent sets of con­cepts at the same low level. This would be fairly strong ev­i­dence of mul­ti­ple con­cepts at the smart hu­man level.

I would also want to point out that a small frac­tion of the con­cepts be­ing differ­ent would be enough to make al­ign­ment much harder. Even if their was a perfect scale, if 13 of the con­cepts are sub­hu­man, 13 hu­man level and 13 su­per­hu­man, it would be hard to un­der­stand the sys­tem. To get any safety, you need to get your sys­tem very close to hu­man con­cepts. And you need to be con­fi­dant that you have hit this tar­get.

• From the tran­script with Paul Chris­ti­ano.

Plus, most things can’t de­stroy the ex­pected value of the fu­ture by 10%. You just can’t have that many things, oth­er­wise there’s not go­ing to be any value left in the end.

I don’t un­der­stand. Maybe it is just the case that there’s no value left af­ter a large num­ber of things that re­duces the ex­pected value by 10%?

• Paul is im­plic­itly con­di­tion­ing his ac­tions on be­ing in a world where there’s a de­cent amount of ex­pected value left for his ac­tions to af­fect. This is tech­ni­cally part of a de­ci­sion pro­ce­dure, rather than a state­ment about epistemic cre­dences, but it’s con­fus­ing be­cause he frames it as an epistemic cre­dence.

• The biggest dis­agree­ment be­tween me and more pes­simistic re­searchers is that I think grad­ual take­off is much more likely than dis­con­tin­u­ous take­off (and in fact, the first, third and fourth para­graphs above are quite weak if there’s a dis­con­tin­u­ous take­off).

It’s been ar­gued be­fore that Con­tin­u­ous is not the same as Slow by any nor­mal stan­dard, so the strat­egy of ‘deal­ing with things as they come up’, while more vi­able un­der a con­tin­u­ous sce­nario, will prob­a­bly not be suffi­cient.

It seems to me like you’re as­sum­ing longter­mists are very likely not re­quired at all in a case where progress is con­tin­u­ous. I take con­tin­u­ous to just mean that we’re in a world where there won’t be sud­den jumps in ca­pa­bil­ity, or ap­par­ently use­less sys­tems sud­denly cross­ing some thresh­old and be­com­ing su­per­in­tel­li­gent, not where progress is slow or easy to re­verse. We could still pick a com­pletely wrong ap­proach that makes al­ign­ment much more difficult and set our­selves on a likely path to­wards dis­aster, even if the fol­low­ing is true:

So far as I can tell, the best one-line sum­mary for why we should ex­pect a con­tin­u­ous and not a fast take­off comes from the in­ter­view Paul Chris­ti­ano gave on the 80k pod­cast: ‘I think if you op­ti­mize AI sys­tems for rea­son­ing, it ap­pears much, much ear­lier.’
So far as I can tell, Paul’s point is that ab­sent spe­cific rea­sons to think oth­er­wise, the prima fa­cie case that any time we are try­ing hard to op­ti­mize for some crite­ria, we should ex­pect the ‘many small changes that add up to one big effect’ situ­a­tion.
Then he goes on to ar­gue that the spe­cific ar­gu­ments that AGI is a rare case where this isn’t true (like nu­clear weapons) are ei­ther wrong or aren’t strong enough to make dis­con­tin­u­ous progress plau­si­ble.

In a world where con­tin­u­ous but mod­er­ately fast take­off is likely, I can eas­ily imag­ine doom sce­nar­ios that would re­quire long term strat­egy or con­cep­tual re­search early on to avoid, even if none of them in­volve FOOM. Imag­ine that the ac­cepted stan­dard for al­igned AI is fol­lows some par­tic­u­lar re­search agenda, like Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing, but it turns out that CIRL starts to be­have patholog­i­cally and tries to wire­head it­self as it gets more and more ca­pa­ble, and that its a fairly deep flaw that we can only patch and not avoid.

Let’s say that over the course of a cou­ple of years failures of CIRL sys­tems start to ap­pear and com­pound very rapidly un­til they con­sti­tute an Ex­is­ten­tial dis­aster. Maybe peo­ple re­al­ize what’s go­ing on, but by then it would be too late, be­cause the right ap­proach would have been to try some other ap­proach to AI al­ign­ment but the re­search to do that doesn’t ex­ist and can’t be done any­where near fast enough. Like Paul Chris­ti­ano’s what failure looks like

• In the situ­a­tions you de­scribe, I would still be some­what op­ti­mistic about co­or­di­na­tion. But yeah, such situ­a­tions lead­ing to doom seem plau­si­ble, and this is why the es­ti­mate is 90% in­stead of 95% or 99%. (Though note that the num­bers are very rough.)

• Nice to see that there are not just rad­i­cal po­si­tions in the AI safety crowd, and there is a drift away from alarmism and to­wards “let’s try var­i­ous ap­proaches, iter­ate and see what we can learn” in­stead of “we must figure out AI safety first, or else!” Also, Chris­ti­ano’s ap­proach “let’s at least en­sure we can build some­thing rea­son­ably safe for the near term”, since one way or an­other, some­thing will get built, has at least a chance of suc­cess.

My per­sonal guess, as some­one who knows noth­ing about ML and very lit­tle about AI safety, but a non-zero amount about re­search and de­vel­op­ment in gen­eral, is that the em­bed­ded agency prob­lems are way too deep to be satis­fac­to­rily re­solved be­fore ML gets the AI to the level of an av­er­age pro­gram­mer. But maybe MIRI, like the NSA, has a few tricks up its sleeve that are not visi­ble to the gen­eral pub­lic. Though this does not seem likely, oth­er­wise a lot of the re­cent dis­cus­sions of em­bed­ded agency would be smoke and mir­rors, not some­thing MIRI is likely to en­gage in.

• ″ There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left. ”

This is the ar­gu­ment from con­se­quences fal­lacy. There may be many things that could de­stroy the fu­ture with high prob­a­bil­ity and we are sim­ply doomed BUT the more in­ter­est­ing sce­nario and a much bet­ter work­ing as­sump­tion is that there po­ten­tially dan­ger­ous things that are likely to de­stroy the fu­ture IF we don’t seek to un­der­stand them and try to cor­rect them by con­certed effort as op­posed to con­tin­u­ing on as we do now with teh level of effort and con­cern we have now.