# [AN #80]: Why AI risk might be solved without additional intervention from longtermists

In this edition, I summarize four conversations that AI Impacts had with researchers who were optimistic that AI safety would be solved "by default". (Note that one of the conversations was with me.)

While all four of these con­ver­sa­tions cov­ered very differ­ent top­ics, I think there were three main points of con­ver­gence. First, we were rel­a­tively un­con­vinced by the tra­di­tional ar­gu­ments for AI risk, and find dis­con­ti­nu­ities rel­a­tively un­likely. Se­cond, we were more op­ti­mistic about solv­ing the prob­lem in the fu­ture, when we know more about the prob­lem and have more ev­i­dence about pow­er­ful AI sys­tems. And fi­nally, we were more op­ti­mistic that as we get more ev­i­dence of the prob­lem in the fu­ture, the ex­ist­ing ML com­mu­nity will ac­tu­ally try to fix that prob­lem.

Con­ver­sa­tion with Paul Chris­ti­ano (Paul Chris­ti­ano, Asya Ber­gal, Ronny Fer­nan­dez, and Robert Long) (sum­ma­rized by Ro­hin): There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left (ETA: see this com­ment). So, the prior that any par­tic­u­lar thing has such an im­pact should be quite low. With AI in par­tic­u­lar, ob­vi­ously we’re go­ing to try to make AI sys­tems that do what we want them to do. So start­ing from this po­si­tion of op­ti­mism, we can then eval­u­ate the ar­gu­ments for doom. The two main ar­gu­ments: first, we can’t dis­t­in­guish ahead of time be­tween AIs that are try­ing to do the right thing, and AIs that are try­ing to kill us, be­cause the lat­ter will be­have nicely un­til they can ex­e­cute a treach­er­ous turn. Se­cond, since we don’t have a crisp con­cept of “do­ing the right thing”, we can’t se­lect AI sys­tems on whether they are do­ing the right thing.

How­ever, there are many “sav­ing throws”, or ways that the ar­gu­ment could break down, avoid­ing doom. Per­haps there’s no prob­lem at all, or per­haps we can cope with it with a lit­tle bit of effort, or per­haps we can co­or­di­nate to not build AIs that de­stroy value. Paul as­signs a de­cent amount of prob­a­bil­ity to each of these (and other) sav­ing throws, and any one of them suffices to avoid doom. This leads Paul to es­ti­mate that AI risk re­duces the ex­pected value of the fu­ture by roughly 10%, a rel­a­tively op­ti­mistic num­ber. Since it is so ne­glected, con­certed effort by longter­mists could re­duce it to 5%, mak­ing it still a very valuable area for im­pact. The main way he ex­pects to change his mind is from ev­i­dence from more pow­er­ful AI sys­tems, e.g. as we build more pow­er­ful AI sys­tems, per­haps in­ner op­ti­mizer con­cerns will ma­te­ri­al­ize and we’ll see ex­am­ples where an AI sys­tem ex­e­cutes a non-catas­trophic treach­er­ous turn.

Paul also be­lieves that clean al­gorith­mic prob­lems are usu­ally solv­able in 10 years, or prov­ably im­pos­si­ble, and early failures to solve a prob­lem don’t provide much ev­i­dence of the difficulty of the prob­lem (un­less they gen­er­ate proofs of im­pos­si­bil­ity). So, the fact that we don’t know how to solve al­ign­ment now doesn’t provide very strong ev­i­dence that the prob­lem is im­pos­si­ble. Even if the clean ver­sions of the prob­lem were im­pos­si­ble, that would sug­gest that the prob­lem is much more messy, which re­quires more con­certed effort to solve but also tends to be just a long list of rel­a­tively easy tasks to do. (In con­trast, MIRI thinks that pro­saic AGI al­ign­ment is prob­a­bly im­pos­si­ble.)

Note that even find­ing out that the prob­lem is im­pos­si­ble can help; it makes it more likely that we can all co­or­di­nate to not build dan­ger­ous AI sys­tems, since no one wants to build an un­al­igned AI sys­tem. Paul thinks that right now the case for AI risk is not very com­pel­ling, and so peo­ple don’t care much about it, but if we could gen­er­ate more com­pel­ling ar­gu­ments, then they would take it more se­ri­ously. If in­stead you think that the case is already com­pel­ling (as MIRI does), then you would be cor­re­spond­ingly more pes­simistic about oth­ers tak­ing the ar­gu­ments se­ri­ously and co­or­di­nat­ing to avoid build­ing un­al­igned AI.

One po­ten­tial rea­son MIRI is more doomy is that they take a some­what broader view of AI safety: in par­tic­u­lar, in ad­di­tion to build­ing an AI that is try­ing to do what you want it to do, they would also like to en­sure that when the AI builds suc­ces­sors, it does so well. In con­trast, Paul sim­ply wants to leave the next gen­er­a­tion of AI sys­tems in at least as good a situ­a­tion as we find our­selves in now, since they will be both bet­ter in­formed and more in­tel­li­gent than we are. MIRI has also pre­vi­ously defined al­igned AI as one that pro­duces good out­comes when run, which is a much broader con­cep­tion of the prob­lem than Paul has. But prob­a­bly the main dis­agree­ment be­tween MIRI and ML re­searchers and that ML re­searchers ex­pect that we’ll try a bunch of stuff, and some­thing will work out, whereas MIRI ex­pects that the prob­lem is re­ally hard, such that trial and er­ror will only get you solu­tions that ap­pear to work.

Ro­hin’s opinion: A gen­eral theme here seems to be that MIRI feels like they have very strong ar­gu­ments, while Paul thinks that they’re plau­si­ble ar­gu­ments, but aren’t ex­tremely strong ev­i­dence. Sim­ply hav­ing a lot more un­cer­tainty leads Paul to be much more op­ti­mistic. I agree with most of this.

How­ever, I do dis­agree with the point about “clean” prob­lems. I agree that clean al­gorith­mic prob­lems are usu­ally solved within 10 years or are prov­ably im­pos­si­ble, but it doesn’t seem to me like AI risk counts as a clean al­gorith­mic prob­lem: we don’t have a nice for­mal state­ment of the prob­lem that doesn’t rely on in­tu­itive con­cepts like “op­ti­miza­tion”, “try­ing to do some­thing”, etc. This sug­gests to me that AI risk is more “messy”, and so may re­quire more time to solve.

Con­ver­sa­tion with Ro­hin Shah (Ro­hin Shah, Asya Ber­gal, Robert Long, and Sara Hax­hia) (sum­ma­rized by Ro­hin): The main rea­son I am op­ti­mistic about AI safety is that we will see prob­lems in ad­vance, and we will solve them, be­cause no­body wants to build un­al­igned AI. A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here.

In ad­di­tion, many of the clas­sic ar­gu­ments for AI safety in­volve a sys­tem that can be de­com­posed into an ob­jec­tive func­tion and a world model, which I sus­pect will not be a good way to model fu­ture AI sys­tems. In par­tic­u­lar, cur­rent sys­tems trained by RL look like a grab bag of heuris­tics that cor­re­late well with ob­tain­ing high re­ward. I think that as AI sys­tems be­come more pow­er­ful, the heuris­tics will be­come more and more gen­eral, but they still won’t de­com­pose nat­u­rally into an ob­jec­tive func­tion, a world model, and search. In ad­di­tion, we can look at hu­mans as an ex­am­ple: we don’t fully pur­sue con­ver­gent in­stru­men­tal sub­goals; for ex­am­ple, hu­mans can be con­vinced to pur­sue differ­ent goals. This makes me more skep­ti­cal of tra­di­tional ar­gu­ments.

I would guess that AI sys­tems will be­come more in­ter­pretable in the fu­ture, as they start us­ing the fea­tures /​ con­cepts /​ ab­strac­tions that hu­mans are us­ing. Even­tu­ally, suffi­ciently in­tel­li­gent AI sys­tems will prob­a­bly find even bet­ter con­cepts that are alien to us, but if we only con­sider AI sys­tems that are (say) 10x more in­tel­li­gent than us, they will prob­a­bly still be us­ing hu­man-un­der­stand­able con­cepts. This should make al­ign­ment and over­sight of these sys­tems sig­nifi­cantly eas­ier. For sig­nifi­cantly stronger sys­tems, we should be del­e­gat­ing the prob­lem to the AI sys­tems that are 10x more in­tel­li­gent than us. (This is very similar to the pic­ture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been pub­lished and I was not aware of Chris’s views at the time of this con­ver­sa­tion.)

I’m also less wor­ried about race dy­nam­ics in­creas­ing ac­ci­dent risk than the me­dian re­searcher. The benefit of rac­ing a lit­tle bit faster is to have a lit­tle bit more power /​ con­trol over the fu­ture, while also in­creas­ing the risk of ex­tinc­tion a lit­tle bit. This seems like a bad trade from each agent’s per­spec­tive. (That is, the Nash equil­ibrium is for all agents to be cau­tious, be­cause the po­ten­tial up­side of rac­ing is small and the po­ten­tial down­side is large.) I’d be more wor­ried if [AI risk is real AND not ev­ery­one agrees AI risk is real when we have pow­er­ful AI sys­tems], or if the po­ten­tial up­side was larger (e.g. if rac­ing a lit­tle more made it much more likely that you could achieve a de­ci­sive strate­gic ad­van­tage).

Over­all, it feels like there’s around 90% chance that AI would not cause x-risk with­out ad­di­tional in­ter­ven­tion by longter­mists. The biggest dis­agree­ment be­tween me and more pes­simistic re­searchers is that I think grad­ual take­off is much more likely than dis­con­tin­u­ous take­off (and in fact, the first, third and fourth para­graphs above are quite weak if there’s a dis­con­tin­u­ous take­off). If I con­di­tion on dis­con­tin­u­ous take­off, then I mostly get very con­fused about what the world looks like, but I also get a lot more wor­ried about AI risk, es­pe­cially be­cause the “AI is to hu­mans as hu­mans are to ants” anal­ogy starts look­ing more ac­cu­rate. In the in­ter­view I said 70% chance of doom in this world, but with way more un­cer­tainty than any of the other cre­dences, be­cause I’m re­ally con­fused about what that world looks like. Two other dis­agree­ments, be­sides the ones above: I don’t buy Real­ism about ra­tio­nal­ity (AN #25), whereas I ex­pect many pes­simistic re­searchers do. I may also be more pes­simistic about our abil­ity to write proofs about fuzzy con­cepts like those that arise in al­ign­ment.

On timelines, I es­ti­mated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be us­ing “es­sen­tially cur­rent tech­niques” (which is ob­nox­iously hard to define). Con­di­tional on both of those, I es­ti­mated 70% chance that it would be some­thing like a mesa op­ti­mizer; mostly be­cause op­ti­miza­tion is a very use­ful in­stru­men­tal strat­egy for solv­ing many tasks, es­pe­cially be­cause gra­di­ent de­scent and other cur­rent al­gorithms are very weak op­ti­miza­tion al­gorithms (rel­a­tive to e.g. hu­mans), and so learned op­ti­miza­tion al­gorithms will be nec­es­sary to reach hu­man lev­els of sam­ple effi­ciency.

Ro­hin’s opinion: Look­ing over this again, I’m re­al­iz­ing that I didn’t em­pha­size enough that most of my op­ti­mism comes from the more out­side view type con­sid­er­a­tions: that we’ll get warn­ing signs that the ML com­mu­nity won’t ig­nore, and that the AI risk ar­gu­ments are not wa­ter­tight. The other parts are par­tic­u­lar in­side view dis­agree­ments that make me more op­ti­mistic, but they don’t fac­tor in much into my op­ti­mism be­sides be­ing ex­am­ples of how the meta con­sid­er­a­tions could play out. I’d recom­mend this com­ment of mine to get more of a sense of how the meta con­sid­er­a­tions fac­tor into my think­ing.

I was also glad to see that I still broadly agree with things I said ~5 months ago (since no ma­jor new op­pos­ing ev­i­dence has come up since then), though as I men­tioned above, I would now change what I place em­pha­sis on.

Con­ver­sa­tion with Robin Han­son (Robin Han­son, Asya Ber­gal, and Robert Long) (sum­ma­rized by Ro­hin): The main theme of this con­ver­sa­tion is that AI safety does not look par­tic­u­larly com­pel­ling on an out­side view. Progress in most ar­eas is rel­a­tively in­cre­men­tal and con­tin­u­ous; we should ex­pect the same to be true for AI, sug­gest­ing that timelines should be quite long, on the or­der of cen­turies. The cur­rent AI boom looks similar to pre­vi­ous AI booms, which didn’t amount to much in the past.

Timelines could be short if progress in AI were “lumpy”, as in a FOOM sce­nario. This could hap­pen if in­tel­li­gence was one sim­ple thing that just has to be dis­cov­ered, but Robin ex­pects that in­tel­li­gence is ac­tu­ally a bunch of not-very-gen­eral tools that to­gether let us do many things, and we sim­ply have to find all of these tools, which will pre­sum­ably not be lumpy. Most of the value from tools comes from more spe­cific, nar­row tools, and in­tel­li­gence should be similar. In ad­di­tion, the liter­a­ture on hu­man unique­ness sug­gests that it wasn’t “raw in­tel­li­gence” or small changes to brain ar­chi­tec­ture that makes hu­mans unique, it’s our abil­ity to pro­cess cul­ture (com­mu­ni­cat­ing via lan­guage, learn­ing from oth­ers, etc).

In any case, many re­searchers are now dis­tanc­ing them­selves from the FOOM sce­nario, and are in­stead ar­gu­ing that AI risk oc­curs due to stan­dard prin­ci­pal-agency prob­lems, in the situ­a­tion where the agent (AI) is much smarter than the prin­ci­pal (hu­man). Robin thinks that this doesn’t agree with the ex­ist­ing liter­a­ture on prin­ci­pal-agent prob­lems, in which losses from prin­ci­pal-agent prob­lems tend to be bounded, even when the agent is smarter than the prin­ci­pal.

You might think that since the stakes are so high, it’s worth work­ing on it any­way. Robin agrees that it’s worth hav­ing a few peo­ple (say a hun­dred) pay at­ten­tion to the prob­lem, but doesn’t think it’s worth spend­ing a lot of effort on it right now. Effort is much more effec­tive and use­ful once the prob­lem be­comes clear, or once you are work­ing with a con­crete de­sign; we have nei­ther of these right now and so we should ex­pect that most effort ends up be­ing in­effec­tive. It would be bet­ter if we saved our re­sources for the fu­ture, or if we spent time think­ing about other ways that the fu­ture could go (as in his book, Age of Em).

It’s es­pe­cially bad that AI safety has thou­sands of “fans”, be­cause this leads to a “cry­ing wolf” effect—even if the re­searchers have sub­tle, nu­anced be­liefs, they can­not con­trol the mes­sage that the fans con­vey, which will not be nu­anced and will in­stead con­fi­dently pre­dict doom. Then when doom doesn’t hap­pen, peo­ple will learn not to be­lieve ar­gu­ments about AI risk.

Ro­hin’s opinion: In­ter­est­ingly, I agree with al­most all of this, even though it’s (kind of) ar­gu­ing that I shouldn’t be do­ing AI safety re­search at all. The main place I dis­agree is that losses from prin­ci­pal-agent prob­lems with perfectly ra­tio­nal agents are bounded—this seems crazy to me, and I’d be in­ter­ested in spe­cific pa­per recom­men­da­tions (though note I and oth­ers have searched and not found many).

On the point about lump­iness, my model is that there are only a few un­der­ly­ing fac­tors (such as the abil­ity to pro­cess cul­ture) that al­low hu­mans to so quickly learn to do so many tasks, and al­most all tasks re­quire near-hu­man lev­els of these fac­tors to be done well. So, once AI ca­pa­bil­ities on these fac­tors reach ap­prox­i­mately hu­man level, we will “sud­denly” start to see AIs beat­ing hu­mans on many tasks, re­sult­ing in a “lumpy” in­crease on the met­ric of “num­ber of tasks on which AI is su­per­hu­man” (which seems to be the met­ric that peo­ple of­ten use, though I don’t like it, pre­cisely be­cause it seems like it wouldn’t mea­sure progress well un­til AI be­comes near-hu­man-level).

Con­ver­sa­tion with Adam Gleave (Adam Gleave et al) (sum­ma­rized by Ro­hin): Adam finds the tra­di­tional ar­gu­ments for AI risk un­con­vinc­ing. First, it isn’t clear that we will build an AI sys­tem that is so ca­pa­ble that it can fight all of hu­man­ity from its ini­tial po­si­tion where it doesn’t have any re­sources, le­gal pro­tec­tions, etc. While dis­con­tin­u­ous progress in AI could cause this, Adam doesn’t see much rea­son to ex­pect such dis­con­tin­u­ous progress: it seems like AI is pro­gress­ing by us­ing more com­pu­ta­tion rather than find­ing fun­da­men­tal in­sights. Se­cond, we don’t know how difficult AI safety will turn out to be; he gives a prob­a­bil­ity of ~10% that the prob­lem is as hard as (a car­i­ca­ture of) MIRI sug­gests, where any de­sign not based on math­e­mat­i­cal prin­ci­ples will be un­safe. This is es­pe­cially true be­cause as we get closer to AGI we’ll have many more pow­er­ful AI tech­niques that we can lev­er­age for safety. Thirdly, Adam does ex­pect that AI re­searchers will even­tu­ally solve safety prob­lems; they don’t right now be­cause it seems pre­ma­ture to work on those prob­lems. Adam would be more wor­ried if there were more arms race dy­nam­ics, or more em­piri­cal ev­i­dence or solid the­o­ret­i­cal ar­gu­ments in sup­port of spec­u­la­tive con­cerns like in­ner op­ti­miz­ers. He would be less wor­ried if AI re­searchers spon­ta­neously started to work on rel­a­tive prob­lems (more than they already do).

Adam makes the case for AI safety work differ­ently. At the high­est level, it seems pos­si­ble to build AGI, and some or­ga­ni­za­tions are try­ing very hard to build AGI, and if they suc­ceed it would be trans­for­ma­tive. That alone is enough to jus­tify some effort into mak­ing sure such a tech­nol­ogy is used well. Then, look­ing at the field it­self, it seems like the field is not cur­rently fo­cused on do­ing good sci­ence and en­g­ineer­ing to build safe, re­li­able sys­tems. So there is an op­por­tu­nity to have an im­pact by push­ing on safety and re­li­a­bil­ity. Fi­nally, there are sev­eral tech­ni­cal prob­lems that we do need to solve be­fore AGI, such as how we get in­for­ma­tion about what hu­mans ac­tu­ally want.

Adam also thinks that it’s 40-50% likely that when we build AGI, a PhD the­sis de­scribing it would be un­der­stand­able by re­searchers to­day with­out too much work, but ~50% that it’s some­thing rad­i­cally differ­ent. How­ever, it’s only 10-20% likely that AGI comes only from small vari­a­tions of cur­rent tech­niques (i.e. by vastly in­creas­ing data and com­pute). He would see this as more likely if we hit ad­di­tional mile­stones by in­vest­ing more com­pute and data (OpenAI Five was an ex­am­ple of such a mile­stone).

Ro­hin’s opinion: I broadly agree with all of this, with two main differ­ences. First, I am less wor­ried about some of the tech­ni­cal prob­lems that Adam men­tions, such as how to get in­for­ma­tion about what hu­mans want, or how to im­prove the ro­bust­ness of AI sys­tems, and more con­cerned about the more tra­di­tional prob­lem of how to cre­ate an AI sys­tem that is try­ing to do what you want. Se­cond, I am more bullish on the cre­ation of AGI us­ing small vari­a­tions on cur­rent tech­niques, but vastly in­creas­ing com­pute and data (I’d as­sign ~30%, while Adam as­signs 10-20%).

• There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left. So, the prior that any par­tic­u­lar thing has such an im­pact should be quite low.

I don’t fol­low this ar­gu­ment; I also checked the tran­script, and I still don’t see why I should buy it. Paul said:

A pri­ori you might’ve been like, well, if you’re go­ing to build some AI, you’re prob­a­bly go­ing to build the AI so it’s try­ing to do what you want it to do. Prob­a­bly that’s that. Plus, most things can’t de­stroy the ex­pected value of the fu­ture by 10%. You just can’t have that many things, oth­er­wise there’s not go­ing to be any value left in the end. In par­tic­u­lar, if you had 100 such things, then you’d be down to like 1/​1000th of your val­ues. 110 hun­dred thou­sandth? I don’t know, I’m not good at ar­ith­metic.

Any­way, that’s a pri­ori, just aren’t that many things are that bad and it seems like peo­ple would try and make AI that’s try­ing to do what they want.

In my words, the ar­gu­ment is “we agree that the fu­ture has non­triv­ial EV, there­fore big nega­tive im­pacts are a pri­ori un­likely”.

But why do we agree about this? Why are we as­sum­ing the fu­ture can’t be that bleak in ex­pec­ta­tion? I think there are good out­side-view ar­gu­ments to this effect, but that isn’t the rea­son­ing here.

• E.g. if you have a broad dis­tri­bu­tion over pos­si­ble wor­lds, some of which are “frag­ile” and have 100 things that cut value down by 10%, and some of which are “ro­bust” and don’t, then you get 10,000x more value from the ro­bust wor­lds. So un­less you are a pri­ori pretty con­fi­dent that you are in a frag­ile world (or they are 10,000x more valuable, or what­ever), the ro­bust wor­lds will tend to dom­i­nate.

Similar ar­gu­ments work if we ag­gre­gate across pos­si­ble paths to achiev­ing value within a fixed, known world—if there are sev­eral ways things can go well, some of which are more ro­bust, those will drive al­most all of the EV. And similarly for moral un­cer­tainty (if there are sev­eral plau­si­ble views, the ones that con­sider this world a lost cause will in­stead spend their in­fluence on other wor­lds) and so forth. I think it’s a rea­son­ably ro­bust con­clu­sion across many differ­ent frame­works: your de­ci­sion shouldn’t end up be­ing dom­i­nated by some hugely con­junc­tive event.

• I’m more un­cer­tain about this one, but I be­lieve that a sep­a­rate prob­lem with this an­swer is that it’s an ar­gu­ment about where value comes from, not an ar­gu­ment about what is prob­a­ble. Let’s sup­pose 50% of all wor­lds are frag­ile and 50% are ro­bust. If most of the things that de­stroy a world are due to emerg­ing tech­nol­ogy, then we still have similar amounts of both wor­lds around right now (or similar mea­sure on both classes if they’re in­finite many, or what­ever). So it’s not a rea­son to sus­pect a non-frag­ile world right now.

• Yes, but the fact that the frag­ile wor­lds are much more likely to end in the fu­ture is a rea­son to con­di­tion your efforts on be­ing in a ro­bust world.

While I do buy Paul’s ar­gu­ment, I think it’d be very helpful if the var­i­ous sum­maries of the in­ter­views with him were ed­ited to make it clear that he’s talk­ing about value-con­di­tioned prob­a­bil­ities rather than un­con­di­tional prob­a­bil­ities—since the claim as origi­nally stated feels mis­lead­ing. (Even if some de­ci­sion the­o­ries only use the former, most peo­ple think in terms of the lat­ter).

• value-con­di­tioned probabilities

Is this a thing or some­thing you just coined? “Prob­a­bil­ity” has a mean­ing, I’m to­tally against us­ing it for things that aren’t that.

I get why the ar­gu­ment is valid for de­cid­ing what we should do – and you could ar­gue that’s the only im­por­tant thing. But it doesn’t make it more likely that our world is ro­bust, which is what the post was claiming. It’s not about prob­a­bil­ity, it’s about EV.

• This ar­gu­ment seems to point at some ex­tremely im­por­tant con­sid­er­a­tions in the vicinity of “we should act ac­cord­ing to how we want civ­i­liza­tions similar to us to act” (rather than just fo­cus­ing on causally in­fluenc­ing our fu­ture light cone), etc.

The de­tails of the dis­tri­bu­tion over pos­si­ble wor­lds that you use here seem to mat­ter a lot. How ro­bust are the “ro­bust wor­lds”? If they are max­i­mally ro­bust (i.e. things turn out great with prob­a­bil­ity 1 no mat­ter what the civ­i­liza­tion does) then we should as­sign zero weight to the prospect of be­ing in a “ro­bust world”, and place all our chips on be­ing in a “frag­ile world”.

Con­trar­ily, if the dis­tri­bu­tion over pos­si­ble wor­lds as­signs suffi­cient prob­a­bil­ity to wor­lds in which there is a sin­gle very risky thing that cuts EV down by ei­ther 10% or 90% de­pend­ing on whether the civ­i­liza­tion takes it se­ri­ously or not, then per­haps such wor­lds should dom­i­nate our de­ci­sion mak­ing.

• E.g. if you have a broad dis­tri­bu­tion over pos­si­ble wor­lds, some of which are “frag­ile” and have 100 things that cut value down by 10%, and some of which are “ro­bust” and don’t, then you get 10,000x more value from the ro­bust wor­lds. So un­less you are a pri­ori pretty con­fi­dent that you are in a frag­ile world (or they are 10,000x more valuable, or what­ever), the ro­bust wor­lds will tend to dom­i­nate.

This is only true if you as­sume that there is an equal num­ber of ro­bust and frag­ile wor­lds out there, and your un­cer­tainty is strictly ran­dom, i.e. you’re un­cer­tain about which of those wor­lds you live in.

I’m not su­per con­fi­dent that our world is frag­ile, but I sus­pect that most wor­lds look the same. I.e., maybe 99.99% of wor­lds are ro­bust, maybe 99.99% are frag­ile. If it’s the lat­ter, then I prob­a­bly live in a frag­ile world.

• If it’s a 50% chance that 99.99% of wor­lds are ro­bust and 50% chance that 99.99% are frag­ile, then the vast ma­jor­ity of EV comes from the first op­tion where the vast ma­jor­ity of wor­lds are ro­bust.

• You’re right, the na­ture of un­cer­tainty doesn’t ac­tu­ally mat­ter for the EV. My bad.

• A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here.

I’d be in­ter­ested to hear a bit more about your po­si­tion on this.

I’m go­ing to ar­gue for the “ap­ply­ing bandaid fixes that don’t scale” po­si­tion for a sec­ond. To me, it seems that there’s a strong cul­ture in ML of “ap­ply ran­dom fixes un­til some­thing looks like it works” and then just rol­ling with what­ever comes out of that al­gorithm.

I’ll draw at­ten­tion to image mod­el­ling to illus­trate what I’m point­ing at. Up un­til about 2014, the main met­ric for eval­u­at­ing an image qual­ity was the bayesian nega­tive log likely­hood. As far as I can tell, this goes all the way back to at least “To Rec­og­nize Shapes, First Learn to Gen­er­ate Images” Where the CD al­gorithm acts to min­i­mize the log like­li­hood of the data. This can be seen in the VAE pa­per and also the origi­nal GAN pa­per. How­ever, af­ter GANs be­came pop­u­lar, the log likely­hood met­ric seemed to have gone out the win­dow. The GANs made re­ally com­pel­ling images. Due to the difficulty of eval­u­at­ing NLL, peo­ple in­vented new met­rics. ID and FID were used to as­sess the qual­ity of the gen­er­ated images. I might be wrong, but I think it took a while af­ter that for peo­ple to re­al­ize that SOTA GANs we’re get­ting ter­rible NNLs com­pared to SOTA VAEs, even though the VAE’s gen­er­ated images that we’re sig­nifi­cantly blur­rier/​nois­ier. It also be­came ob­vi­ous that GANs were drop­ping modes of the dis­tri­bu­tion, effec­tively failing to model en­tire classes of images.

As far as I can, tell there’s been a lot of work to get GANs to model all image modes. The most salient and re­cent would be Deep­Minds PresGAN . Where they clearly show the is­sue and how PresGAN solves it in Figure 1. How­ever, look­ing at table 5, there’s still a huge gap be­tween in NLL be­tween PresGAN and VAEs. It seems to me that most of the at­tempt to solve this is­sue are very similar to “bandaid fixes that don’t scale” in the sense that they mostly feel like hacks. None of them re­ally ad­dress the gap in likely­hood be­tween VAEs and GANs.

I’m wor­ried that a similar story could hap­pen with AI safety. A prob­lem arises and gets swept un­der the rug for a bit. Later, it’s re­dis­cov­ered and be­comes com­mon knowl­edge. Then, in­stead of solv­ing it be­fore mov­ing for­ward, we see mas­sive in­creases in ca­pa­bil­ities. Si­mul­ta­neously, the prob­lem is at most ad­dressed with hacks that don’t re­ally solve the prob­lem, or solve it just enough to pre­vent the in­crease in ca­pa­bil­ities from be­com­ing ob­vi­ously un­jus­tified.

• To me, it seems that there’s a strong cul­ture in ML of “ap­ply ran­dom fixes un­til some­thing looks like it works” and then just rol­ling with what­ever comes out of that al­gorithm.

I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world.

I’ll draw at­ten­tion to image mod­el­ling to illus­trate what I’m point­ing at. [...] It also be­came ob­vi­ous that GANs were drop­ping modes of the dis­tri­bu­tion, effec­tively failing to model en­tire classes of images. [...] None of them re­ally ad­dress the gap in likely­hood be­tween VAEs and GANs.

Per­haps peo­ple are op­ti­miz­ing for “mak­ing pretty pic­tures” in­stead of “nega­tive log like­li­hood”. I wouldn’t be sur­prised if for many ap­pli­ca­tions of GANs, di­ver­sity of images is not ac­tu­ally that im­por­tant, and what you re­ally want is that the few images you do gen­er­ate look re­ally good. In that case, it makes com­plete sense to push pri­mar­ily on GANs, and while you try to ad­dress mode col­lapse, when faced with a trade­off you choose GANs over VAEs any­way.

I’m wor­ried that a similar story could hap­pen with AI safety. A prob­lem arises and gets swept un­der the rug for a bit.

Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug?

As­sum­ing your an­swer is no, then it seems like whether a prob­lem gets swept un­der the rug de­pends on par­tic­u­lar em­piri­cal con­sid­er­a­tions, such as:

• How bad it would be if the prob­lem was real (the mag­ni­tude of the down­side). This could be eval­u­ated with re­spect to so­ciety and to the in­di­vi­d­ual agents de­cid­ing whether or not to de­ploy the po­ten­tially prob­le­matic AI.

• How com­pel­ling the ev­i­dence is that the prob­lem is real.

I tend to think that ex­ist­ing prob­lems with AI are not that bad (though in most cases ob­vi­ously quite real), while long-term con­cerns about AI would be very bad, but are not ob­vi­ously real. If the long-term con­cerns are real, we should get more ev­i­dence about them in the fu­ture, and then we’ll have a prob­lem that is both very bad and (more) clearly real, and that’s when I ex­pect that it will be taken se­ri­ously.

Con­sider e.g. fair­ness and bias. No­body thinks that the prob­lem is solved. Peo­ple do con­tinue to de­ploy un­fair and bi­ased AI sys­tems, but that’s be­cause the down­side of un­fair and bi­ased AI sys­tems is smaller in mag­ni­tude than the up­side of us­ing the AI sys­tems in the first place—they aren’t be­ing de­ployed be­cause peo­ple think they have “solved the prob­lem”.

• I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world.

[...]

Per­haps peo­ple are op­ti­miz­ing for “mak­ing pretty pic­tures” in­stead of “nega­tive log like­li­hood”. I wouldn’t be sur­prised if for many ap­pli­ca­tions of GANs, di­ver­sity of images is not ac­tu­ally that im­por­tant, and what you re­ally want is that the few images you do gen­er­ate look re­ally good. In that case, it makes com­plete sense to push pri­mar­ily on GANs, and while you try to ad­dress mode col­lapse, when faced with a trade­off you choose GANs over VAEs any­way.

This is fair. How­ever, the point of the ex­am­ple is more that mode drop­ping and bad NLL were not no­ticed when peo­ple started op­ti­miz­ing GANs for image qual­ity. As far as I can tell, it took a while for in­di­vi­d­u­als to no­tice, longer for it to be­come com­mon knowl­edge, and even more time for any­one to do any­thing about it. Even now, the “solu­tions” are hacks that don’t com­pletely re­solve the is­sue.

There was a large win­dow of time where a prac­ti­tioner could im­ple­ment a GAN ex­pect­ing it to cover all the modes. If there was a world where failing to cover all the modes of the dis­tri­bu­tion lead to large nega­tive con­se­quences, the failure would prob­a­bly have gone un­no­ticed un­til it was too late.

Here’s a real ex­am­ple. This is the NTSB crash re­port for the Uber au­tonomous ve­hi­cle that kil­led a pedes­trian. Some­one should prob­a­bly do an in depth anal­y­sis of the whole thing, but for now I’ll draw your at­ten­tion to sec­tion 1.6.2. Hazard Avoidance and Emer­gency Brak­ing. In it they say:

When the sys­tem de­tects an emer­gency situ­a­tion, it ini­ti­ates ac­tion sup­pres­sion. This is a one-sec­ond pe­riod dur­ing which the ADS sup­presses planned brak­ing while the (1) sys­tem ver­ifies the na­ture of the de­tected haz­ard and calcu­lates an al­ter­na­tive path, or (2) ve­hi­cle op­er­a­tor takes con­trol of the ve­hi­cle. ATG stated that it im­ple­mented ac­tion sup­pres­sion pro­cess due to the con­cerns of the de­vel­op­men­tal ADS iden­ti­fy­ing false alarms—de­tec­tion of a haz­ardous situ­a­tion when none ex­ists—caus­ing the ve­hi­cle to en­gage in un­nec­es­sary ex­treme ma­neu­vers.

[...]

if the col­li­sion can­not be avoided with the ap­pli­ca­tion of the max­i­mum al­lowed brak­ing, the sys­tem is de­signed to provide an au­di­tory warn­ing to the ve­hi­cle op­er­a­tor while si­mul­ta­neously ini­ti­at­ing grad­ual ve­hi­cle slow­down. In such cir­cum­stance, ADS would not ap­ply the max­i­mum brak­ing to only miti­gate the col­li­sion.

This strikes me as a “ran­dom fix” where the core is­sue was that the sys­tem did not have suffi­cient dis­crim­i­na­tory power to tell apart a safe situ­a­tion from an un­safe situ­a­tion. In­stead of prop­erly solv­ing this prob­lem, the re­searchers put in a hack.

Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug?

I agree that we shouldn’t be wor­ried about situ­a­tions where there is a clear threat. But that’s not quite the class of failures that I’m wor­ried about. Fair­ness, bias, and ad­ver­sar­ial ex­am­ples are all closer to what I’m get­ting at. The gen­eral pat­tern is that ML re­searchers hack to­gether a sys­tem that works, but has some prob­lems they’re un­aware of. Later, the prob­lems are dis­cov­ered and the re­ac­tion is to hack to­gether a solu­tion. This is pretty much the op­po­site of the safety mind­set EY was talk­ing about. It leaves room for catas­tro­phe in the ini­tial win­dow when the prob­lem goes un­de­tected, and in­definitely af­ter­wards if the hack is in­suffi­cient to deal with the is­sue.

More speci­fi­cally, I’m wor­ried about a situ­a­tion where at some point dur­ing grad stu­dent de­cent some­one says, “That’s funny...” then goes on to pub­lish their work. Later, some­one else de­ploys their idea plus 3 or­ders of mag­ni­tude more com­put­ing power and we all die. That, or we don’t all die. In­stead we re­solve the is­sue with a hack. Then a cou­ple bumps in com­put­ing power and ca­pa­bil­ities later we all die.

The above comes across as both para­noid and far­feched, and I’m not sure the AI com­mu­nity will take on the re­quired level of cau­tion to pre­vent it un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI. Nu­clear re­ac­tor de­sign is the only do­main I know of where peo­ple are close to suffi­ciently para­noid.

• I’m not sure the AI com­mu­nity will take on the re­quired level of cau­tion to pre­vent it un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI.

Im­por­tant thing to re­mem­ber is that Ro­hin is ex­plic­itly talk­ing about a non-foom sce­nario, so the as­sump­tion is that hu­man­ity would sur­vive AI-Ch­er­nobyl.

• My worry is less that we wouldn’t sur­vive AI-Ch­er­nobyl as much as it is that we won’t get an AI-Ch­er­nobyl.

I think that this is where there’s a differ­ence in mod­els. Even in a non-FOOM sce­nario I’m hav­ing a hard time en­vi­sion­ing a world where the gap in ca­pa­bil­ities be­tween AI-Ch­er­nobyl and global catas­trophic UFAI is that large. I used Ch­er­nobyl as an ex­am­ple be­cause it scared the pub­lic and the in­dus­try into mak­ing things very safe. It had a lot go­ing for it to make that hap­pen. Ra­di­a­tion is in­visi­ble and hurts you by ei­ther kil­ling you in­stantly, mak­ing your skin fall off, or giv­ing you can­cer and birth defects. The dis­aster was also ex­tremely ex­pen­sive, with the to­tal costs on the or­der of 10^11 USD$. If a defec­tive AI sys­tem man­ages to do some­thing that in­stils the same level of fear into re­searchers and the pub­lic as Ch­er­nobyl did, I would ex­pect that we were on the cusp of build­ing sys­tems that we couldn’t con­trol at all. If I’m right and the gap be­tween those two events is small, then there’s a sig­nifi­cant risk that noth­ing will hap­pen in that win­dow. We’ll get plenty of warn­ings that won’t be suffi­cient to in­stil the nec­es­sary level of cau­tion into the com­mu­nity, and later down the road we’ll find our­selves in a situ­a­tion we can’t re­cover from. • My im­pres­sion is that peo­ple work­ing on self-driv­ing cars are in­cred­ibly safety-con­scious, be­cause the risks are very salient. I don’t think AI-Ch­er­nobyl has to be a Ch­er­nobyl level dis­aster, just some­thing that makes the risks salient. E.g. per­haps an el­der care AI robot pre­tends that all of its pa­tients are fine in or­der to pre­serve its ex­is­tence, and this leads to a death and is then dis­cov­ered. If hos­pi­tals let AI al­gorithms make de­ci­sions about drugs ac­cord­ing to com­pli­cated re­ward func­tions, I would ex­pect this to hap­pen with cur­rent ca­pa­bil­ities. (It’s no­table to me that this doesn’t already hap­pen, given the in­sane hype around AI.) • My im­pres­sion is that peo­ple work­ing on self-driv­ing cars are in­cred­ibly safety-con­scious, be­cause the risks are very salient. Safety con­scious peo­ple work­ing on self driv­ing cars don’t pro­gram their cars to not take eva­sive ac­tion af­ter de­tect­ing that a col­li­sion is im­mi­nent. (It’s no­table to me that this doesn’t already hap­pen, given the in­sane hype around AI.) I think it already has.(It was for ex­tra care, not drugs, but it’s a clear cut case of a mis­speci­fied ob­jec­tive func­tion lead­ing to sub­op­ti­mal de­ci­sions for a mul­ti­tude of in­di­vi­d­u­als.) I’ll note, per­haps un­fairly, that the fact that this study was not salient enough to make it to your at­ten­tion even with a cul­ture war sig­nal boost is ev­i­dence that it needs to be a Ch­er­nobyl level event. • I agree that Tesla does not seem very safety con­scious (but it’s no­table that they are still safer than hu­man drivers in terms of fatal­ities per mile, if I re­mem­ber cor­rectly?) I think it already has. Huh, what do you know. Faced with an ac­tual ex­am­ple, I’m re­al­iz­ing that what I ac­tu­ally ex­pect would cause peo­ple to take it more se­ri­ously is a) the be­lief that AGI is near and b) an ex­am­ple where the AI al­gorithm “de­liber­ately” causes a prob­lem (i.e. “with full knowl­edge” that the thing it was do­ing was not what we wanted). I think most deep RL re­searchers already be­lieve that re­ward hack­ing is a thing (which is what that study shows). even with a cul­ture war sig­nal boost Tan­gen­tial, but that makes it less likely that I read it; I try to com­pletely ig­nore any­thing with the term “racial bias” in its ti­tle un­less it’s di­rectly per­ti­nent to me. (Be­ing about AI isn’t enough to make it per­ti­nent to me.) • Faced with an ac­tual ex­am­ple, I’m re­al­iz­ing that what I ac­tu­ally ex­pect would cause peo­ple to take it more se­ri­ously is a) the be­lief that AGI is near and b) an ex­am­ple where the AI al­gorithm “de­liber­ately” causes a prob­lem (i.e. “with full knowl­edge” that the thing it was do­ing was not what we wanted). What do you ex­pect the ML com­mu­nity to do at that point? Co­or­di­nate to stop or slow down the race to AGI un­til AI safety/​al­ign­ment is solved? Or do you think each com­pany/​lab will unilat­er­ally in­vest more into safety/​al­ign­ment with­out slow­ing down ca­pa­bil­ity re­search much, and that will be suffi­cient? Or some­thing else? I worry about a par­allel with the “en­ergy com­mu­nity”, a large part of which not just ig­nores but ac­tively tries to ob­scure or down­play warn­ing signs about fu­ture risks as­so­ci­ated with cer­tain forms of en­ergy pro­duc­tion. Given that the run-up to AGI will likely gen­er­ate huge prof­its for AI com­pa­nies as well as provide clear benefits for many peo­ple (com­pared to which, the dis­asters that will have oc­curred by then may well seem tol­er­able by com­par­i­son), and given prob­a­ble dis­agree­ments be­tween differ­ent ex­perts about how se­ri­ous the fu­ture risks are, it seems likely to me that AI risk will be­come poli­ti­cized/​con­tro­ver­sial in a way similar to cli­mate change, which will pre­vent effec­tive co­or­di­na­tion around it. On the other hand… maybe AI will be more like nu­clear power than fos­sil fuels, and a few big ac­ci­dents will stall its de­ploy­ment for quite a while. Is this why you’re rel­a­tively op­ti­mistic about AI risk be­ing taken se­ri­ously, and if so can you share why you think nu­clear power is a closer anal­ogy? • What do you ex­pect the ML com­mu­nity to do at that point? It de­pends a lot on the par­tic­u­lar warn­ing shot that we get. But on the strong ver­sions of warn­ing shots, where there’s com­mon knowl­edge that build­ing an AGI runs a sub­stan­tial risk of de­stroy­ing the world, yes, I ex­pect them to not build AGI un­til safety is solved. (Not to the stan­dard you usu­ally imag­ine, where we must also solve philo­soph­i­cal prob­lems, but to the stan­dard I usu­ally imag­ine, where the AGI is not try­ing to de­ceive us or work against us.) This de­pends on other back­ground fac­tors, e.g. how much the var­i­ous ac­tors think they are value-al­igned vs. in zero-sum com­pe­ti­tion. I cur­rently think the ML com­mu­nity thinks they are mostly but not fully value-al­igned, and they will in­fluence com­pa­nies and gov­ern­ments in that di­rec­tion. (I also want more longter­mists to be try­ing to build more com­mon knowl­edge of how much hu­mans are value al­igned, to make this more likely.) I worry about a par­allel with the “en­ergy com­mu­nity” The ma­jor dis­anal­ogy is that catas­trophic out­comes of cli­mate change do not per­son­ally af­fect the CEOs of en­ergy com­pa­nies very much, whereas AI x-risk af­fects ev­ery­one. (Also, maybe we haven’t got­ten clear and ob­vi­ous warn­ing shots?) (com­pared to which, the dis­asters that will have oc­curred by then may well seem tol­er­able by com­par­i­son), and given prob­a­ble dis­agree­ments be­tween differ­ent ex­perts about how se­ri­ous the fu­ture risks are I agree that my story re­quires com­mon knowl­edge of the risk of build­ing AGI, in the sense that you need peo­ple to pre­dict “run­ning this code might lead to all hu­mans dy­ing”, and not “run­ning this code might lead to <warn­ing shot effect>”. You also need rel­a­tive agree­ment on the risks. I think this is pretty achiev­able. Most of the ML com­mu­nity already agrees that build­ing an AGI is high-risk if not done with some ar­gu­ment for safety. The thing peo­ple tend to dis­agree on is when we will get AGI and how much we should work on safety be­fore then. • But on the strong ver­sions of warn­ing shots, where there’s com­mon knowl­edge that build­ing an AGI runs a sub­stan­tial risk of de­stroy­ing the world, yes, I ex­pect them to not build AGI un­til safety is solved. (Not to the stan­dard you usu­ally imag­ine, where we must also solve philo­soph­i­cal prob­lems, but to the stan­dard I usu­ally imag­ine, where the AGI is not try­ing to de­ceive us or work against us.) To the ex­tent that we ex­pect strong warn­ing shots and abil­ity to avoid build­ing AGI upon re­ceiv­ing such warn­ing shots, this seems like an ar­gu­ment for re­searchers/​longter­mists to work on /​ ad­vo­cate for safety prob­lems be­yond the stan­dard of “AGI is not try­ing to de­ceive us or work against us” (be­cause that stan­dard will likely be reached any­way). Do you agree? The ma­jor dis­anal­ogy is that catas­trophic out­comes of cli­mate change do not per­son­ally af­fect the CEOs of en­ergy com­pa­nies very much, whereas AI x-risk af­fects ev­ery­one. Some types of AI x-risk don’t af­fect ev­ery­one though (e.g., ones that re­duce the long term value of the uni­verse or mul­ti­verse with­out kil­ling ev­ery­one in the near term). • To the ex­tent that we ex­pect strong warn­ing shots and abil­ity to avoid build­ing AGI upon re­ceiv­ing such warn­ing shots, this seems like an ar­gu­ment for re­searchers/​longter­mists to work on /​ ad­vo­cate for safety prob­lems be­yond the stan­dard of “AGI is not try­ing to de­ceive us or work against us” (be­cause that stan­dard will likely be reached any­way). Do you agree? Yes. Some types of AI x-risk don’t af­fect ev­ery­one though (e.g., ones that re­duce the long term value of the uni­verse or mul­ti­verse with­out kil­ling ev­ery­one in the near term). Agreed, all else equal those seem more likely to me. • Ok, I wasn’t sure that you’d agree, but given that you do, it seems that when you wrote the ti­tle of this newslet­ter “Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longter­mists” you must have meant “Why some forms of AI risk …”, or per­haps cer­tain forms of AI risk just didn’t come to your mind at that time. In ei­ther case it seems worth clar­ify­ing some­where that you don’t cur­rently en­dorse in­ter­pret­ing “AI risk” as “AI risk in its en­tirety” in that sen­tence. Similarly, on the in­side you wrote: The main rea­son I am op­ti­mistic about AI safety is that we will see prob­lems in ad­vance, and we will solve them, be­cause no­body wants to build un­al­igned AI. A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here. It seems worth clar­ify­ing that you’re only op­ti­mistic about cer­tain types of AI safety prob­lems. (I’m ba­si­cally mak­ing the same com­plaint/​sug­ges­tion that I made to Matthew Bar­nett not too long ago. I don’t want to be too repet­i­tive or an­noy­ing, so let me know if I’m start­ing to sound that way.) • It seems worth clar­ify­ing that you’re only op­ti­mistic about cer­tain types of AI safety prob­lems. Tbc, I’m op­ti­mistic about all the types of AI safety prob­lems that peo­ple have pro­posed, in­clud­ing the philo­soph­i­cal ones. When I said “all else equal those seem more likely to me”, I meant that if all the other facts about the mat­ter are the same, but one risk af­fects only fu­ture peo­ple and not cur­rent peo­ple, that risk would seem more likely to me be­cause peo­ple would care less about it. But I am op­ti­mistic about the ac­tual risks that you and oth­ers ar­gue for. That said, over the last week I have be­come less op­ti­mistic speci­fi­cally about over­com­ing race dy­nam­ics, mostly from talk­ing to peo­ple at FHI /​ GovAI. I’m not sure how much to up­date though. (Still broadly op­ti­mistic.) it seems that when you wrote the ti­tle of this newslet­ter “Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longter­mists” you must have meant “Why some forms of AI risk …”, or per­haps cer­tain forms of AI risk just didn’t come to your mind at that time. It’s no­table that AI Im­pacts asked for peo­ple who were skep­ti­cal of AI risk (or some­thing along those lines) and to my eye it looks like all four of the peo­ple in the newslet­ter in­de­pen­dently in­ter­preted that as ac­ci­den­tal tech­ni­cal AI risk in which the AI is ad­ver­sar­i­ally op­ti­miz­ing against you (or at least that’s what the four peo­ple ar­gued against). This seems like pretty strong ev­i­dence that when peo­ple hear “AI risk” they now think of tech­ni­cal ac­ci­den­tal AI risk, re­gard­less of what the his­tor­i­cal defi­ni­tion may have been. I know cer­tainly that is my de­fault as­sump­tion when some­one (other than you) says “AI risk”. I would cer­tainly sup­port hav­ing clearer defi­ni­tions and ter­minol­ogy if we could all agree on them. • But I am op­ti­mistic about the ac­tual risks that you and oth­ers ar­gue for. Why? I ac­tu­ally wrote a re­ply that was more ques­tion­ing in tone, and then changed it be­cause I found some com­ments you made where you seemed to be con­cerned about the ad­di­tional AI risks. Good thing I saved a copy of the origi­nal re­ply, so I’ll just paste it be­low: I won­der if you would con­sider writ­ing an overview of your per­spec­tive on AI risk strat­egy. (You do have a se­quence but I’m look­ing for some­thing that’s more com­pre­hen­sive, that in­cludes e.g. hu­man safety and philo­soph­i­cal prob­lems. Or let me know if there’s an ex­ist­ing post that I’ve missed.) I ask be­cause you’re one of the most pro­lific par­ti­ci­pants here but don’t fall into one of the ex­ist­ing “camps” on AI risk for whom I already have good mod­els for. It’s hap­pened sev­eral times that I see a com­ment from you that seems wrong or un­clear, but I’m afraid to risk be­ing an­noy­ing or repet­i­tive with my ques­tions/​ob­jec­tions. (I some­times worry that I’ve already brought up some is­sue with you and then for­got your an­swer.) It would help a lot to have a bet­ter model of you in my head and in writ­ing so I can re­fer to that to help me in­ter­pret what the most likely in­tended mean­ing of a com­ment is, or to pre­dict how you would likely an­swer if I were to ask cer­tain ques­tions. It’s no­table that AI Im­pacts asked for peo­ple who were skep­ti­cal of AI risk (or some­thing along those lines) and to my eye it looks like all four of the peo­ple in the newslet­ter in­de­pen­dently in­ter­preted that as ac­ci­den­tal tech­ni­cal AI risk in which the AI is ad­ver­sar­i­ally op­ti­miz­ing against you (or at least that’s what the four peo­ple ar­gued against). Maybe that’s be­cause the ques­tion was asked in a way that in­di­cated the ques­tioner was mostly in­ter­ested in tech­ni­cal ac­ci­den­tal AI risk? And some of them may be fine with defin­ing “AI risk” as “AI-caused x-risk” but just didn’t have the other risks on the top of their minds, be­cause their per­sonal fo­cus is on the tech­ni­cal/​ac­ci­den­tal side. In other words I don’t think this is strong ev­i­dence that all 4 peo­ple would en­dorse defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk”. It also seems no­table that I’ve been us­ing “AI risk” in a broad sense for a while and no one has ob­jected to that us­age un­til now. I would cer­tainly sup­port hav­ing clearer defi­ni­tions and ter­minol­ogy if we could all agree on them. The cur­rent situ­a­tion seems to be that we have two good (rel­a­tively clear) terms “tech­ni­cal ac­ci­den­tal AI risk” and “AI-caused x-risk” and the dis­pute is over what plain “AI risk” should be short­hand for. Does that seem fair? • I ask be­cause you’re one of the most pro­lific par­ti­ci­pants here but don’t fall into one of the ex­ist­ing “camps” on AI risk for whom I already have good mod­els for. Seems right, I think my opinions fall clos­est to Paul’s, though it’s also hard for me to tell what Paul’s opinions are. I think this older thread is a rel­a­tively good sum­mary of the con­sid­er­a­tions I tend to think about, though I’d place differ­ent em­phases now. (Sadly I don’t have the time to write a proper post about what I think about AI strat­egy—it’s a pretty big topic.) The cur­rent situ­a­tion seems to be that we have two good (rel­a­tively clear) terms “tech­ni­cal ac­ci­den­tal AI risk” and “AI-caused x-risk” and the dis­pute is over what plain “AI risk” should be short­hand for. Does that seem fair? Yes, though I would frame it as “the ~5 peo­ple read­ing these com­ments have two clear terms, while ev­ery­one else uses a con­fus­ing mish­mash of terms”. The hard part is in get­ting ev­ery­one else to use the terms. I am gen­er­ally skep­ti­cal of de­cid­ing on defi­ni­tions and get­ting ev­ery­one else to use them, and usu­ally try to use terms the way other peo­ple use terms. In other words I don’t think this is strong ev­i­dence that all 4 peo­ple would en­dorse defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk”. It also seems no­table that I’ve been us­ing “AI risk” in a broad sense for a while and no one has ob­jected to that us­age un­til now. Agreed with this, but see above about try­ing to con­form with the way terms are used, rather than defin­ing terms and try­ing to drag ev­ery­one else along. • see above about try­ing to con­form with the way terms are used, rather than defin­ing terms and try­ing to drag ev­ery­one else along. This seems odd given your ob­jec­tion to “soft/​slow” take­off us­age and your ad­vo­cacy of “con­tin­u­ous take­off” ;) • I don’t think “soft/​slow take­off” has a canon­i­cal mean­ing—some peo­ple (e.g. Paul) in­ter­pret it as not hav­ing dis­con­ti­nu­ities, while oth­ers in­ter­pret it as ca­pa­bil­ities in­creas­ing slowly past hu­man in­tel­li­gence over (say) cen­turies (e.g. Su­per­in­tel­li­gence). If I say “slow take­off” I don’t know which one the listener is go­ing to hear it as. (And if I had to guess, I’d ex­pect they think about the cen­turies-long ver­sion, which is usu­ally not the one I mean.) In con­trast, I think “AI risk” has a much more canon­i­cal mean­ing, in that if I say “AI risk” I ex­pect most listen­ers to in­ter­pret it as ac­ci­den­tal risk caused by the AI sys­tem op­ti­miz­ing for goals that are not our own. (Per­haps an im­por­tant point is that I’m try­ing to com­mu­ni­cate to a much wider au­di­ence than the peo­ple who read all the Align­ment Fo­rum posts and com­ments. I’d feel more okay about “slow take­off” if I was just speak­ing to peo­ple who have read many of the posts already ar­gu­ing about take­off speeds.) • AI risk is just a short­hand for “ac­ci­den­tal tech­ni­cal AI risk.” To the ex­tent that peo­ple are con­fused, I agree it’s prob­a­bly worth clar­ify­ing the type of risk by adding “ac­ci­den­tal” and “tech­ni­cal” when­ever we can. How­ever, I dis­agree with the idea that we should ex­pand the word AI risk to in­clude philo­soph­i­cal failures and in­ten­tional risks. If you open the term up, these out­comes might start to hap­pen: • It be­comes un­clear in con­ver­sa­tion what peo­ple mean when they say AI risk • Like The Sin­gu­lar­ity, it be­comes a buz­zword. • Jour­nal­ists start pro­ject­ing Ter­mi­na­tor sce­nar­ios onto the words, and now have jus­tifi­ca­tion be­cause even the re­searchers say that AI risk can mean a lot of differ­ent things. • It puts a whole bunch of types of risk into one bas­ket, sug­gest­ing to out­siders that all at­tempts to re­duce “AI risk” might be equally worth­while. • ML re­searchers start to dis­trust AI risk re­searchers, be­cause peo­ple who are wor­ried about the Ter­mi­na­tor are us­ing the same words as the AI risk re­searchers and there­fore get as­so­ci­ated with them. This can all be avoided by hav­ing a com­mu­nity norm to clar­ify that we mean tech­ni­cal ac­ci­den­tal risk when we say AI risk, and when we’re talk­ing about other types of risks we use more pre­cise ter­minol­ogy. • AI risk is just a short­hand for “ac­ci­den­tal tech­ni­cal AI risk.” I don’t think “AI risk” was origi­nally meant to be a short­hand for “ac­ci­den­tal tech­ni­cal AI risk”. The ear­liest con­sid­ered (i.e., not off-hand) us­age I can find is in the ti­tle of Luke Muehlhauser’s AI Risk and Op­por­tu­nity: A Strate­gic Anal­y­sis where he defined it as “the risk of AI-caused ex­tinc­tion”. (He used “ex­tinc­tion” but nowa­days we tend think in terms of “ex­is­ten­tial risk” which also in­cludes “per­ma­nent large nega­tive con­se­quences”, which seems like an rea­son­able ex­pan­sion of “AI risk”.) How­ever, I dis­agree with the idea that we should ex­pand the word AI risk to in­clude philo­soph­i­cal failures and in­ten­tional risks. I want to in­clude philo­soph­i­cal failures, as long as the con­se­quences of the failures flow through AI, be­cause (aside from his­tor­i­cal us­age) tech­ni­cal prob­lems and philo­soph­i­cal prob­lems blend into each other, and I don’t see a point in draw­ing an ar­bi­trary and po­ten­tially con­tentious bor­der be­tween them. (Is UDT a tech­ni­cal ad­vance or a philo­soph­i­cal ad­vance? Is defin­ing the right util­ity func­tion for a Sovereign Sin­gle­ton a tech­ni­cal prob­lem or a philo­soph­i­cal prob­lem? Why force our­selves to an­swer these ques­tions?) As for “in­ten­tional risks” it’s already com­mon prac­tice to in­clude that in “AI risk”: Di­vid­ing AI risks into mi­suse risks and ac­ci­dent risks has be­come a pre­vailing ap­proach in the field. Be­sides that, I think there’s also a large grey area be­tween “ac­ci­dent risk” and “mi­suse” where the risk partly comes from tech­ni­cal/​philo­soph­i­cal prob­lems and partly from hu­man na­ture. For ex­am­ple hu­mans might be eas­ily per­suaded by wrong but psy­cholog­i­cally con­vinc­ing moral/​philo­soph­i­cal ar­gu­ments that AIs can come up with and then or­der their AIs to do ter­rible things. Even pure in­ten­tional risks might have tech­ni­cal solu­tions. Again I don’t re­ally see the point of try­ing to figure out which of these prob­lems should be ex­cluded from “AI risk”. It be­comes un­clear in con­ver­sa­tion what peo­ple mean when they say AI risk It seems perfectly fine to me to use that as short­hand for “AI-caused x-risk” and use more spe­cific terms when we mean more spe­cific risks. Like The Sin­gu­lar­ity, it be­comes a buzzword What do you mean? Like peo­ple will use “AI risk” when their pro­ject has noth­ing to do with “AI-caused x-risk”? Couldn’t they do that even if we define “AI risk” to be “ac­ci­den­tal tech­ni­cal AI risk”? Jour­nal­ists start pro­ject­ing Ter­mi­na­tor sce­nar­ios onto the words, and now have jus­tifi­ca­tion be­cause even the re­searchers say that AI risk can mean a lot of differ­ent things. Ter­mi­na­tor sce­nar­ios seem to be sce­nar­ios of “ac­ci­den­tal tech­ni­cal AI risk” (they’re just not very re­al­is­tic sce­nar­ios) so I don’t see how defin­ing “AI risk” to mean that would pre­vent jour­nal­ists from us­ing Ter­mi­na­tor sce­nar­ios to illus­trate “AI risk”. It puts a whole bunch of types of risk into one bas­ket, sug­gest­ing to out­siders that all at­tempts to re­duce “AI risk” might be equally worth­while. I don’t think this is a good ar­gu­ment, be­cause even within “ac­ci­den­tal tech­ni­cal AI risk” there are differ­ent prob­lems that aren’t equally worth­while to solve, so why aren’t you already wor­ried about out­siders think­ing all those prob­lems are equally worth­while? ML re­searchers start to dis­trust AI risk re­searchers, be­cause peo­ple who are wor­ried about the Ter­mi­na­tor are us­ing the same words as the AI risk re­searchers and there­fore get as­so­ci­ated with them. See my re­sponse above re­gard­ing “Ter­mi­na­tor sce­nar­ios”. This can all be avoided by hav­ing a com­mu­nity norm to clar­ify that we mean tech­ni­cal ac­ci­den­tal risk when we say AI risk, and when we’re talk­ing about other types of risks we use more pre­cise ter­minol­ogy. I pro­pose that we in­stead stick with his­tor­i­cal prece­dent and keep “AI risk” to mean “AI-caused x-risk” and use more pre­cise ter­minol­ogy to re­fer to more spe­cific types of AI-caused x-risk that we might want to talk about. Aside from what I wrote above, it’s just more in­tu­itive/​com­mon­sen­si­cal that “AI risk” means “AI-caused x-risk” in gen­eral in­stead of a spe­cific kind of AI-caused x-risk. How­ever I ap­pre­ci­ate that some­one who works mostly on the less philo­soph­i­cal /​ less hu­man-re­lated prob­lems might find it tire­some to say or type “tech­ni­cal ac­ci­den­tal AI risk” all the time to de­scribe what they do or to dis­cuss the im­por­tance of their work, and can find it very tempt­ing to just use “AI risk”. It would prob­a­bly be good to cre­ate a (differ­ent) short­hand or acronym for it to re­move this temp­ta­tion and to make their lives eas­ier. • I ap­pre­ci­ate the ar­gu­ments, and I think you’ve mostly con­vinced me, mostly be­cause of the his­tor­i­cal ar­gu­ment. I do still have some re­main­ing ap­pre­hen­sion about us­ing AI risk to de­scribe ev­ery type of risk aris­ing from AI. I want to in­clude philo­soph­i­cal failures, as long as the con­se­quences of the failures flow through AI, be­cause (aside from his­tor­i­cal us­age) tech­ni­cal prob­lems and philo­soph­i­cal prob­lems blend into each other, and I don’t see a point in draw­ing an ar­bi­trary and po­ten­tially con­tentious bor­der be­tween them. That is true. The way I see it, UDT is definitely on the tech­ni­cal side, even though it in­cor­po­rates a large amount of philo­soph­i­cal back­ground. When I say tech­ni­cal, I mostly mean “spe­cific, uses math, has clear mean­ing within the lan­guage of com­puter sci­ence” rather than a more nar­row mean­ing of “is re­lated to ma­chine learn­ing” or some­thing similar. My is­sue with ar­gu­ing for philo­soph­i­cal failure is that, as I’m sure you’re aware, there’s a well known failure mode of wor­ry­ing about vague philo­soph­i­cal prob­lems rather than more con­crete ones. Within aca­demic philos­o­phy, the ma­jor­ity of dis­cus­sion sur­round­ing AI is cen­tered around con­scious­ness, in­ten­tion­al­ity, whether it’s pos­si­ble to even con­struct a hu­man-like ma­chine, whether they should have rights etc. There’s a unique thread of philos­o­phy that arose from Less­wrong, which in­cludes work on de­ci­sion the­ory, that doesn’t fo­cus on these thorny and low pri­or­ity ques­tions. While I’m com­fortable with you ar­gu­ing that philo­soph­i­cal failure is im­por­tant, my im­pres­sion is that the overly philo­soph­i­cal ap­proach used by many peo­ple has done more harm than good for the field in the past, and con­tinues to do so. It is there­fore some­times nice to tell peo­ple that the prob­lems that peo­ple work on here are con­crete and spe­cific, and don’t re­quire do­ing a ton of ab­stract philos­o­phy or poli­ti­cal ad­vo­cacy. I don’t think this is a good ar­gu­ment, be­cause even within “ac­ci­den­tal tech­ni­cal AI risk” there are differ­ent prob­lems that aren’t equally worth­while to solve, so why aren’t you already wor­ried about out­siders think­ing all those prob­lems are equally worth­while? This is true, but my im­pres­sion is that when you tell peo­ple that a prob­lem is “tech­ni­cal” it gen­er­ally makes them re­frain from hav­ing a strong opinion be­fore un­der­stand­ing a lot about it. “Ac­ci­den­tal” also re­frames the dis­cus­sion by re­duc­ing the risk of po­lariz­ing bi­ases. This is a com­mon theme in many fields: • Physi­cists some­times get frus­trated with peo­ple ar­gu­ing about “the philos­o­phy of the in­ter­pre­ta­tion of quan­tum me­chan­ics” be­cause there’s a large sub­set of peo­ple who think that since it’s philo­soph­i­cal, then you don’t need to have any sub­ject-level ex­per­tise to talk about it. • Economists try to em­pha­size that they use mod­els and em­piri­cal data, be­cause a lot of peo­ple think that their field of study is more-or-less just high sta­tus opinion + math. Em­pha­siz­ing that there are real, spe­cific mod­els that they study helps to re­duce this im­pres­sion. Same with poli­ti­cal sci­ence. • A large frac­tion of tech work­ers are frus­trated about the use of Ma­chine Learn­ing as a buz­zword right now, and part of it is that peo­ple started say­ing Ma­chine Learn­ing = AI rather than Ma­chine Learn­ing = Statis­tics, and so a lot of peo­ple thought that even if they don’t un­der­stand statis­tics, they can un­der­stand AI since that’s like philos­o­phy and stuff. Scott Aaron­son has said But I’ve drawn much closer to the com­mu­nity over the last few years, be­cause of a com­bi­na­tion of fac­tors: [...] The AI-risk folks started pub­lish­ing some re­search pa­pers that I found in­ter­est­ing—some with rel­a­tively ap­proach­able prob­lems that I could see my­self try­ing to think about if quan­tum com­put­ing ever got bor­ing. This shift seems to have hap­pened at roughly around the same time my former stu­dent, Paul Chris­ti­ano, “defected” from quan­tum com­put­ing to AI-risk re­search. My guess is that this shift in his think­ing oc­curred be­cause a lot of peo­ple started talk­ing about tech­ni­cal risks from AI, rather than fram­ing it as a philos­o­phy prob­lem, or a prob­lem of elimi­nat­ing bad ac­tors. Eliezer has shared this view­point for years, writ­ing in the CEV doc­u­ment, Warn­ing: Be­ware of things that are fun to ar­gue. re­flect­ing the temp­ta­tion to de­rail dis­cus­sions about tech­ni­cal ac­ci­den­tal risks. • Also, isn’t defin­ing “AI risk” as “tech­ni­cal ac­ci­den­tal AI risk” analo­gous to defin­ing “ap­ple” as “red ap­ple” (in terms of be­ing cir­cu­lar/​illog­i­cal)? I re­al­ize nat­u­ral lan­guage doesn’t have to be perfectly log­i­cal, but this still seems a bit too egre­gious. • I agree that this is trou­bling, though I think it’s similar to how I wouldn’t want the term biorisk to be ex­panded to in­clude bio­di­ver­sity loss (a risk, but not the right type), reg­u­lar hu­man ter­ror­ism (hu­mans are biolog­i­cal, but it’s a to­tally differ­ent is­sue), zom­bie up­ris­ings (they are biolog­i­cal, but it’s to­tally ridicu­lous), alien in­va­sions etc. Not to say that’s what you are do­ing with AI risk. I’m wor­ried about what oth­ers will do with it if the term gets ex­panded. • I agree that this is trou­bling, though I think it’s similar to how I wouldn’t want the term biorisk to be ex­panded … Well as I said, nat­u­ral lan­guage doesn’t have to be perfectly log­i­cal, and I think “biorisk” is in some­what in that cat­e­gory but there’s an ex­pla­na­tion that makes it a bit rea­son­able than it might first ap­pear, which is that the “bio” refers not to “biolog­i­cal” but to “bioweapon”. This is ac­tu­ally one of the defi­ni­tions that Google gives when you search for “bio”: “re­lat­ing to or in­volv­ing the use of toxic biolog­i­cal or bio­chem­i­cal sub­stances as weapons of war. ‘bioter­ror­ism’” I guess the analo­gous thing would be if we start us­ing “AI” to mean “tech­ni­cal AI ac­ci­dents” in a bunch of phrases, which feels worse to me than the “bio” case, maybe be­cause “AI” is a stan­dalone word/​acronym in­stead of a pre­fix? Does this make sense to you? Not to say that’s what you are do­ing with AI risk. I’m wor­ried about what oth­ers will do with it if the term gets ex­panded. But the term was ex­panded from the be­gin­ning. Have you ac­tu­ally ob­served it be­ing used in ways that you fear (and which would be pre­vented if we were to re­define it more nar­rowly)? • Does this make sense to you? Yeah that makes sense. Your points about “bio” not be­ing short for “biolog­i­cal” were valid, but the fact that as a listener I didn’t know that fact im­plies that it seems re­ally easy to mess up the lan­guage us­age here. I’m start­ing to think that the real fight should be about us­ing terms that aren’t self ex­plana­tory. Have you ac­tu­ally ob­served it be­ing used in ways that you fear (and which would be pre­vented if we were to re­define it more nar­rowly)? I’m not sure about whether it would have been pre­vented by us­ing the term more nar­rowly, but in my ex­pe­rience the most com­mon re­ac­tion peo­ple out­side of EA/​LW (and even some­times within) have to hear­ing about AI risk is to as­sume that it’s not tech­ni­cal, and to as­sume that it’s not about ac­ci­dents. In that sense, I have seen been ex­posed to quite a bit of this already. • As far as I can tell, it took a while for in­di­vi­d­u­als to no­tice, longer for it to be­come com­mon knowl­edge, and even more time for any­one to do any­thing about it. Tan­gen­tial, but I wouldn’t be sur­prised if re­searchers were fairly quickly aware of the is­sue (e.g. within two years of the origi­nal GAN pa­per), but it took a while to be­come com­mon knowl­edge be­cause it isn’t par­tic­u­larly flashy. (There’s a sur­pris­ing-to-me amount of know-how that is stored in re­searcher’s brains and never put down on pa­per.) Even now, the “solu­tions” are hacks that don’t com­pletely re­solve the is­sue. I mean, the solu­tion is to use a VAE. If you care about cov­er­ing modes but not image qual­ity, you choose a VAE; if you care about image qual­ity but not cov­er­ing modes, you choose a GAN. (Also, while I know very lit­tle about VAEs /​ GANs, Im­plicit Max­i­mum Like­li­hood Es­ti­ma­tion sounded like a prin­ci­pled fix to me.) This strikes me as a “ran­dom fix” where the core is­sue was that the sys­tem did not have suffi­cient dis­crim­i­na­tory power to tell apart a safe situ­a­tion from an un­safe situ­a­tion. In­stead of prop­erly solv­ing this prob­lem, the re­searchers put in a hack. Agreed, I would guess that the re­searchers /​ en­g­ineers knew this was risky and thought it was worth it any­way. Or per­haps the man­agers did. But I do agree this is ev­i­dence against my po­si­tion. I agree that we shouldn’t be wor­ried about situ­a­tions where there is a clear threat. But that’s not quite the class of failures that I’m wor­ried about. [...] Later, the prob­lems are dis­cov­ered and the re­ac­tion is to hack to­gether a solu­tion. Why isn’t the threat clear once the prob­lems are dis­cov­ered? un­less we get an AI equiv­a­lent of Ch­er­nobyl be­fore we get UFAI. Part of my claim is that we prob­a­bly will get that (as­sum­ing AI re­ally is risky), though per­haps not Ch­er­nobyl-level dis­aster, but still some­thing with real nega­tive con­se­quences that “could be worse”. • Why isn’t the threat clear once the prob­lems are dis­cov­ered? I think I should be more spe­cific, when you say: Sup­pose that we had ex­tremely com­pel­ling ev­i­dence that any AI sys­tem run with > X amount of com­pute would definitely kill us all. Do you ex­pect that prob­lem to get swept un­der the rug? I mean that no one sane who knows that will run that AI sys­tem with > X amount of com­put­ing power. When I wrote that com­ment I also thought that no one sane would not blow the whis­tle in that event. See my note at the end of the com­ment.* How­ever, when pre­sented with that ev­i­dence, I don’t ex­pect the AI com­mu­nity to re­act ap­pro­pri­ately. The cor­rect re­sponse to that ev­i­dence is to stop what your do­ing, and re­visit the en­tire pro­cess and cul­ture that led to the cre­ation of an al­gorithm that will kill us all if run with >X amount of com­pute. What I ex­pect will hap­pen is that the AI com­mu­nity will try and solve the prob­lem the same way it’s solved ev­ery other prob­lem it has en­coun­tered. It will try an in­or­di­nate amount of un­prin­ci­pled hacks to get around the is­sue. Part of my claim is that we prob­a­bly will get that (as­sum­ing AI re­ally is risky), though per­haps not Ch­er­nobyl-level dis­aster, but still some­thing with real nega­tive con­se­quences that “could be worse”. Con­di­tional on no FOOM, I can definitely see plenty of events with real nega­tive con­se­quences that “could be worse”. How­ever, I claim that any­thing short of a Ch­er­nobyl level event won’t shock the com­mu­nity and the world into chang­ing it’s cul­ture or try­ing to co­or­di­nate. I also claim that the ca­pa­bil­ities gap be­tween a Ch­er­nobyl level event and a global catas­trophic event is small, such that even in a non-FOOM sce­nario the former might not hap­pen be­fore the lat­ter. To­gether, I think that there is a high prob­a­bil­ity that we will not get a dis­aster that is scary enough to get the AI com­mu­nity to change it’s cul­ture and co­or­di­nate be­fore it’s too late. *Now that I think about it more though, I’m less sure. Un­der­grad­u­ate en­g­ineers get en­tire lec­tures ded­i­cated to how and when to blow the whis­tle when faced with un­eth­i­cal cor­po­rate prac­tices and dan­ger­ous pro­jects or de­signs. When work­ing, they also have in­surance and some de­gree of le­gal pro­tec­tion from venge­ful em­ploy­ers. Even then, you still see cover ups of short­com­ings that lead to ma­jor in­dus­trial dis­asters. For in­stance, long be­fore the dis­aster, some­one had de­ter­mined that the fukushima plant was in­deed vuln­er­a­ble to large tsunami im­pacts. The pat­tern where some­one knows that some­thing will go wrong but noth­ing is done to pre­vent it for one rea­son or an­other is not that un­com­mon in en­g­ineer­ing dis­asters. Re­gard­less of whether this is due to hind­sight bias or an in­ad­e­quate pro­cess for ad­dress­ing safety is­sues, these dis­asters still hap­pen reg­u­larly in fields with far more con­ser­va­tive, cau­tious, and safety ori­ented cul­tures. I find it un­likely that the field of AI will change it’s cul­ture from one of mov­ing fast and hack­ing to some­thing even more con­ser­va­tive and cau­tious than the cul­tures of con­sumer aerospace and nu­clear en­g­ineer­ing. • Idk, I don’t know what to say here. I meet lots of AI re­searchers, and the best ones seem to me to be quite thought­ful. I can say what would change my mind: I take the ex­plo­ra­tion of un­prin­ci­pled hacks as very weak ev­i­dence against my po­si­tion, if it’s just in an aca­demic pa­per. My guess is the re­searchers them­selves would not ad­vo­cate de­ploy­ing their solu­tion, or would say that it’s worth de­ploy­ing but it’s an in­cre­men­tal im­prove­ment that doesn’t solve the full prob­lem. And even if the re­searchers don’t say that, I sus­pect the com­pa­nies ac­tu­ally de­ploy­ing the sys­tems would worry about it. I would take the de­ploy­ment of un­prin­ci­pled hacks more se­ri­ously as ev­i­dence, but even there I would want to be con­vinced that shut­ting down the AI sys­tem was a bet­ter de­ci­sion than de­ploy­ing an un­prin­ci­pled hack. (Be­cause then I would have made the same de­ci­sion in their shoes.) Un­prin­ci­pled hacks are in fact quite use­ful for the vast ma­jor­ity of prob­lems; as a re­sult it seems wrong to at­tribute ir­ra­tional­ity to peo­ple be­cause they use un­prin­ci­pled hacks. • I agree that ML of­ten does this, but only in situ­a­tions where the re­sults don’t im­me­di­ately mat­ter. I’d find it much more com­pel­ling to see ex­am­ples where the “ran­dom fix” caused ac­tual bad con­se­quences in the real world. Cur­rent ML cul­ture is to test 100′s of things in a lab un­til one works. This is fine as long as the AI’s be­ing tested are not smart enough to break out of the lab, or re­al­ize they are be­ing tested and play nice un­til de­ploy­ment. The de­fault way to test a de­sign is to run it and see, not to rea­son ab­stractly about it. and then we’ll have a prob­lem that is both very bad and (more) clearly real, and that’s when I ex­pect that it will be taken se­ri­ously. Part of the prob­lem is that we have a re­ally strong unilat­er­al­ist’s curse. It only takes 1, or a few peo­ple who don’t re­al­ize the prob­lem to make some­thing re­ally dan­ger­ous. Ban­ning it is also hard, law en­force­ment isn’t 100% effec­tive, differ­ent coun­tries have differ­ent laws and the main real world in­gre­di­ent is ac­cess to a com­puter. If the long-term con­cerns are real, we should get more ev­i­dence about them in the fu­ture, …I ex­pect that it will be taken se­ri­ously. The peo­ple who are ig­nor­ing or don’t un­der­stand the cur­rent ev­i­dence will carry on ig­nor­ing or not un­der­stand­ing it. A few more peo­ple will be con­vinced, but don’t ex­pect to con­vince a cre­ation­ist with one more tran­si­tional fos­sil. • Part of the prob­lem is that we have a re­ally strong unilat­er­al­ist’s curse. It only takes 1, or a few peo­ple who don’t re­al­ize the prob­lem to make some­thing re­ally dan­ger­ous. This is a foom-ish as­sump­tion; re­mem­ber that Ro­hin is ex­plic­itly talk­ing about a non-foom sce­nario. • ^ Yeah, in FOOM wor­lds I agree more with your (Don­ald’s) rea­son­ing. (Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror?) The peo­ple who are ig­nor­ing or don’t un­der­stand the cur­rent ev­i­dence will carry on ig­nor­ing or not un­der­stand­ing it. I don’t think we have good cur­rent ev­i­dence, so I don’t in­fer much about whether or not peo­ple will buy fu­ture ev­i­dence from their re­ac­tions to cur­rent ev­i­dence. (See also six heuris­tics that I think cut against AI risk even af­ter know­ing the ar­gu­ments for AI risk.) • Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror? You men­tioned that, con­di­tional on foom, you’d be con­fused about what the world looks like. Is this the main thing you’re con­fused about in foom wor­lds, or are there other ma­jor things too? • Lots of other things: • Are we imag­in­ing a small team of hack­ers in their base­ment try­ing to get AGI on a lap­top, or a big cor­po­ra­tion us­ing tons of re­sources? • How does the AGI learn about the world? If you say “it reads the In­ter­net”, how does it learn to read? • When the de­vel­op­ers re­al­ize that they’ve built AGI, is it still pos­si­ble for them to pull the plug? • Why doesn’t the AGI try to be de­cep­tive in ways that we can de­tect, the way chil­dren do? Is it just im­me­di­ately as ca­pa­ble as a smart hu­man and doesn’t need any train­ing? How can that hap­pen by just “find­ing the right ar­chi­tec­ture”? • Why is this likely to hap­pen soon when it hasn’t hap­pened in the last sixty years? I sus­pect an­swers to these will pro­voke lots of other ques­tions. In con­trast, the non-foom wor­lds that still in­volve AGI + very fast growth seem much closer to a “busi­ness-as-usual” world. I also think that if you’re wor­ried about foom, you should ba­si­cally not care about any of the work be­ing done at Deep­Mind /​ OpenAI right now, be­cause that’s not the kind of work that can foom (ex­cept in the “we sud­denly find the right ar­chi­tec­ture” story); yet I no­tice lots of doomy pre­dic­tions about AGI are be­ing driven by DM /​ OAI’s work. (Of course, plau­si­bly you think OpenAI /​ DM are not go­ing to suc­ceed, even if oth­ers do.) • I’m go­ing to start a fresh thread on this, it sounds more in­ter­est­ing (at least to me) than most of the other stuff be­ing dis­cussed here. • Yeah, in FOOM wor­lds I agree more with your (Don­ald’s) rea­son­ing. (Though I still have ques­tions, like, how ex­actly did some­one stum­ble upon the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence by trial and er­ror?) If there’s an im­plicit as­sump­tion here that FOOM wor­lds re­quire some­one to stum­ble upon “the cor­rect math­e­mat­i­cal prin­ci­ples un­der­ly­ing in­tel­li­gence”, I don’t un­der­stand why such an as­sump­tion is jus­tified. For ex­am­ple, sup­pose that at some point in the fu­ture some top AI lab will throw$1B at a sin­gle mas­sive neu­ral ar­chi­tec­ture search—over some ar­bi­trary slightly-novel ar­chi­tec­ture space—and that NAS will stum­ble upon some com­pli­cated ar­chi­tec­ture that its cor­re­spond­ing model, af­ter be­ing trained with a mas­sive amount of com­put­ing power, will im­ple­ment an AGI.

• and that NAS will stum­ble upon some com­pli­cated ar­chi­tec­ture that its cor­re­spond­ing model, af­ter be­ing trained with a mas­sive amount of com­put­ing power, will im­ple­ment an AGI.

In this case I’m ask­ing why the NAS stum­bled upon the cor­rect math­e­mat­i­cal ar­chi­tec­ture un­der­ly­ing in­tel­li­gence.

Or rather, let’s dis­pense with the word “math­e­mat­i­cal” (which I mainly used be­cause it seems to me that the ar­gu­ments for FOOM usu­ally in­volve some­one com­ing up with the right math­e­mat­i­cal in­sight un­der­ly­ing in­tel­li­gence).

It seems to me that to get FOOM you need the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”, which I’ll call frag­ility. Note that you can­not find frag­ile things us­ing lo­cal search, ex­cept if you “get lucky” and start out at the cor­rect solu­tion.

Why did the NAS stum­ble upon the cor­rect frag­ile ar­chi­tec­ture un­der­ly­ing in­tel­li­gence?

• It seems to me that to get FOOM you need the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”

The above ‘FOOM via $1B NAS’ sce­nario doesn’t seem to me to re­quire this prop­erty. No­tice that the in­crease in ca­pa­bil­ities dur­ing that NAS may be grad­ual (i.e. be­fore eval­u­at­ing the model that im­ple­ments an AGI the NAS eval­u­ates mod­els that are “al­most AGI”). The sce­nario would still count as a FOOM as long as the NAS yields an AGI and no model be­fore that NAS ever came close to AGI. Con­di­tioned on [$1B NAS yields the first AGI], a FOOM seems to me par­tic­u­larly plau­si­ble if ei­ther:

1. no pre­vi­ous NAS at a similar scale was ever car­ried out; or

2. the “path in model space” that the NAS tra­verses is very differ­ent from all the paths that pre­vi­ous NASs tra­versed. This seems to me plau­si­ble even if the model space of the $1B NAS is iden­ti­cal to ones used in pre­vi­ous NASs (e.g. if differ­ent ran­dom seeds yield very differ­ent paths); and it seems to me even more plau­si­ble if the model space of the$1B NAS is slightly novel.

• The above ‘FOOM via $1B NAS’ sce­nario doesn’t seem to me to re­quire this prop­erty. No­tice that the in­crease in ca­pa­bil­ities dur­ing that NAS may be grad­ual (i.e. be­fore eval­u­at­ing the model that im­ple­ments an AGI the NAS eval­u­ates mod­els that are “al­most AGI”). The sce­nario would still count as a FOOM as long as the NAS yields an AGI and no model be­fore that NAS ever came close to AGI. In this case I’d ap­ply the frag­ility ar­gu­ment to the re­search pro­cess, which was my origi­nal point (though it wasn’t phrased as well then). In the NAS set­ting, my ques­tion is: how ex­actly did some­one stum­ble upon the cor­rect NAS to run that would lead to in­tel­li­gence by trial and er­ror? Ba­si­cally, if you’re ar­gu­ing that most ML re­searchers just do a bunch of trial-and-er­ror, then you should be mod­el­ing ML re­search as a lo­cal search in idea-space, and then you can ap­ply the same frag­ility ar­gu­ment to it. • Con­di­tioned on [$1B NAS yields the first AGI], that NAS it­self may es­sen­tially be “a lo­cal search in idea-space”. My ar­gu­ment is that such a lo­cal search in idea-space need not start in a world where “al­most-AGI” mod­els already ex­ist (I listed in the grand­par­ent two dis­junc­tive rea­sons in sup­port of this).

Re­lat­edly, “mod­el­ing ML re­search as a lo­cal search in idea-space” is not nec­es­sar­ily con­tra­dic­tory to FOOM, if an im­por­tant part of that lo­cal search can be car­ried out with­out hu­man in­volve­ment (which is a sup­po­si­tion that seems to be sup­ported by the rise of NAS and meta-learn­ing ap­proaches in re­cent years).

I don’t see how my rea­son­ing here re­lies on it be­ing pos­si­ble to “find frag­ile things us­ing lo­cal search”.

• (I listed in the grand­par­ent two dis­junc­tive rea­sons in sup­port of this).

Okay, re­spond­ing to those di­rectly:

no pre­vi­ous NAS at a similar scale was ever car­ried out; or

• What caused the re­searchers to go from “$1M run of NAS” to “$1B run of NAS”, with­out first try­ing “$10M run of NAS”? I es­pe­cially have this ques­tion if you’re mod­el­ing ML re­search as “trial and er­ror”; I can imag­ine jus­tify­ing a$1B ex­per­i­ment be­fore a $10M ex­per­i­ment if you have some com­pel­ling rea­son that the re­sult you want will hap­pen with the$1B ex­per­i­ment but not the $10M ex­per­i­ment; but if you’re do­ing trial and er­ror then you don’t have a com­pel­ling rea­son. • Cur­rent AI sys­tems are very sub­hu­man, and throw­ing more money at NAS has led to rel­a­tively small im­prove­ments. Why don’t we ex­pect similar in­cre­men­tal im­prove­ments from the next 3-4 or­ders of mag­ni­tude of com­pute? • Sup­pose that such a NAS did lead to hu­man-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did? How does that cause a FOOM? (Yes, the im­prove­ments the AI makes com­pound, whereas the im­prove­ments we make to AI don’t com­pound, but to me that’s the canon­i­cal case of con­tin­u­ous take­off, e.g. as de­scribed in Take­off speeds.) the “path in model space” that the NAS tra­verses is very differ­ent from all the paths that pre­vi­ous NASs tra­versed. This seems to me plau­si­ble even if the model space of the$1B NAS is iden­ti­cal to ones used in pre­vi­ous NASs (e.g. if differ­ent ran­dom seeds yield very differ­ent paths); and it seems to me even more plau­si­ble if the model space of the $1B NAS is slightly novel. In all the pre­vi­ous NASs, why did the paths taken pro­duce AI sys­tems that were so much worse than the one taken by the$1B NAS? Did the $1B NAS just get lucky? (Again, this re­ally sounds like a claim that “the path taken by NAS” is frag­ile.) Re­lat­edly, “mod­el­ing ML re­search as a lo­cal search in idea-space” is not nec­es­sar­ily con­tra­dic­tory to FOOM, if an im­por­tant part of that lo­cal search can be car­ried out with­out hu­man involvement If you want to make the case for a dis­con­ti­nu­ity be­cause of the lack of hu­man in­volve­ment, you would need to ar­gue: • The re­place­ment for hu­mans is way cheaper /​ faster /​ more effec­tive than hu­mans (in that case why wasn’t it au­to­mated ear­lier?) • The dis­con­ti­nu­ity hap­pens as soon as hu­mans are re­placed (oth­er­wise, the sys­tem-with­out-hu­man-in­volve­ment be­comes the new baseline, and all fu­ture sys­tems will look like rel­a­tively con­tin­u­ous im­prove­ments of this sys­tem) The sec­ond point definitely doesn’t ap­ply to NAS and meta-learn­ing, and I would ar­gue that the first point doesn’t ap­ply ei­ther, though that’s not ob­vi­ous. • What caused the re­searchers to go from “$1M run of NAS” to “$1B run of NAS”, with­out first try­ing “$10M run of NAS”? I es­pe­cially have this ques­tion if you’re mod­el­ing ML re­search as “trial and er­ror”;

I in­deed model a big part of con­tem­po­rary ML re­search as “trial and er­ror”. I agree that it seems un­likely that be­fore the first $1B NAS there won’t be any$10M NAS. Sup­pose there will even be a $100M NAS just be­fore the$1B NAS that (by as­sump­tion) re­sults in AGI. I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would serve as a fire alarm for AGI. Cur­rent AI sys­tems are very sub­hu­man, and throw­ing more money at NAS has led to rel­a­tively small im­prove­ments. Why don’t we ex­pect similar in­cre­men­tal im­prove­ments from the next 3-4 or­ders of mag­ni­tude of com­pute? If we look at the his­tory of deep learn­ing from ~1965 to 2019, how well do trend ex­trap­o­la­tion meth­ods fare in terms of pre­dict­ing perfor­mance gains for the next 3-4 or­ders of mag­ni­tude of com­pute? My best guess is that they don’t fare all that well. For ex­am­ple, based on data prior to 2011, I as­sume such meth­ods pre­dict mostly busi­ness-as-usual for deep learn­ing dur­ing 2011-2019 (i.e. com­pletely miss­ing the deep learn­ing rev­olu­tion). More gen­er­ally, when us­ing trend ex­trap­o­la­tions in AI, con­sider the fol­low­ing from this Open Phil blog post (2016) by Holden Karnofsky (foot­note 7): The most ex­haus­tive ret­ro­spec­tive anal­y­sis of his­tor­i­cal tech­nol­ogy fore­casts we have yet found, Mul­lins (2012), cat­e­go­rized thou­sands of pub­lished tech­nol­ogy fore­casts by method­ol­ogy, us­ing eight cat­e­gories in­clud­ing “mul­ti­ple meth­ods” as one cat­e­gory. [...] How­ever, when com­par­ing suc­cess rates for method­olo­gies solely within the com­puter tech­nol­ogy area tag, quan­ti­ta­tive trend anal­y­sis performs slight be­low av­er­age, (The link in the quote ap­pears to be bro­ken, here is one that works.) NAS seems to me like a good ex­am­ple for an ex­pen­sive com­pu­ta­tion that could plau­si­bly con­sti­tute a “search in idea-space” that finds an AGI model (with­out hu­man in­volve­ment). But my ar­gu­ment here ap­plies to any such com­pu­ta­tion. I think it may even ap­ply to a ‘$1B SGD’ (on a sin­gle huge net­work), if we con­sider a gra­di­ent up­date (or a se­quence thereof) to be an “ex­plo­ra­tion step in idea-space”.

Sup­pose that such a NAS did lead to hu­man-level AGI. Shouldn’t that mean that the AGI makes progress in AI at the same rate that we did?

I first need to un­der­stand what “hu­man-level AGI” means. Can mod­els in this cat­e­gory pass strong ver­sions of the Tur­ing test? Does this cat­e­gory ex­clude sys­tems that out­perform hu­mans on one or more im­por­tant di­men­sions? (It seems to me that the first SGD-trained model that passes strong ver­sions of the Tur­ing test may be a su­per­in­tel­li­gence.)

In all the pre­vi­ous NASs, why did the paths taken pro­duce AI sys­tems that were so much worse than the one taken by the $1B NAS? Did the$1B NAS just get lucky?

Yes, the $1B NAS may in­deed just get lucky. A lo­cal search some­times gets lucky (in the sense of find­ing a lo­cal op­ti­mum that is a lot bet­ter than the ones found in most runs; not in the sense of mirac­u­lously start­ing the search at a great frag­ile solu­tion). [EDIT: also, some­thing about this NAS might be slightly novel—like the neu­ral ar­chi­tec­ture space.] If you want to make the case for a dis­con­ti­nu­ity be­cause of the lack of hu­man in­volve­ment, you would need to ar­gue: • The re­place­ment for hu­mans is way cheaper /​ faster /​ more effec­tive than hu­mans (in that case why wasn’t it au­to­mated ear­lier?) • The dis­con­ti­nu­ity hap­pens as soon as hu­mans are re­placed (oth­er­wise, the sys­tem-with­out-hu­man-in­volve­ment be­comes the new baseline, and all fu­ture sys­tems will look like rel­a­tively con­tin­u­ous im­prove­ments of this sys­tem) In some past cases where hu­mans did not serve any role in perfor­mance gains that were achieved with more com­pute/​data (e.g. train­ing GPT-2 by scal­ing up GPT), there were no hu­mans to re­place. So I don’t un­der­stand the ques­tion “why wasn’t it au­to­mated ear­lier?” In the sec­ond point, I need to first un­der­stand how you define that mo­ment in which “hu­mans are re­placed”. (In the$1B NAS sce­nario, would that mo­ment be the one in which the NAS is in­voked?)

• Meta: I feel like I am ar­gu­ing for “there will not be a dis­con­ti­nu­ity”, and you are in­ter­pret­ing me as ar­gu­ing for “we will not get AGI soon /​ AGI will not be trans­for­ma­tive”, nei­ther of which I be­lieve. (I have wide un­cer­tainty on timelines, and I cer­tainly think AGI will be trans­for­ma­tive.) I’d like you to state what po­si­tion you think I’m ar­gu­ing for, taboo­ing “dis­con­ti­nu­ity” (not the ar­gu­ments for it, just the po­si­tion).

I in­deed model a big part of con­tem­po­rary ML re­search as “trial and er­ror”. I agree that it seems un­likely that be­fore the first $1B NAS there won’t be any$10M NAS. Sup­pose there will even be a $100M NAS just be­fore the$1B NAS that (by as­sump­tion) re­sults in AGI. I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would serve as a fire alarm for AGI. I’m ar­gu­ing against FOOM, not about whether there will be a fire alarm. The fire alarm ques­tion seems or­thog­o­nal to me. I’m more un­cer­tain about the fire alarm ques­tion. quan­ti­ta­tive trend anal­y­sis performs slight be­low av­er­age [...] NAS seems to me like a good ex­am­ple for an ex­pen­sive com­pu­ta­tion that could plau­si­bly con­sti­tute a “search in idea-space” that finds an AGI model [...] it may even ap­ply to a ‘$1B SGD’ (on a sin­gle huge net­work) [...] the $1B NAS may in­deed just get lucky This sounds to me like say­ing “well, we can’t trust pre­dic­tions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I am not com­pel­led by ar­gu­ments that tell me to worry about sce­nario X with­out giv­ing me a rea­son to be­lieve that sce­nario X is likely. (Com­pare: “we can’t rule out the pos­si­bil­ity that the simu­la­tors want us to build a tower to the moon or else they’ll shut off the simu­la­tion, so we bet­ter get started on that moon tower.”) This is not to say the such sce­nario X’s must be false—re­al­ity could be that way—but that given my limited amount of time, I must pri­ori­tize which sce­nar­ios to pay at­ten­tion to, and one re­ally good heuris­tic for that is to fo­cus on sce­nar­ios that have some in­side-view rea­son that makes me think they are likely. If I had in­finite time, I’d even­tu­ally con­sider these sce­nar­ios (even the simu­la­tors want­ing us to build a moon tower hy­poth­e­sis). Some other more tan­gen­tial things: If we look at the his­tory of deep learn­ing from ~1965 to 2019, how well do trend ex­trap­o­la­tion meth­ods fare in terms of pre­dict­ing perfor­mance gains for the next 3-4 or­ders of mag­ni­tude of com­pute? My best guess is that they don’t fare all that well. For ex­am­ple, based on data prior to 2011, I as­sume such meth­ods pre­dict mostly busi­ness-as-usual for deep learn­ing dur­ing 2011-2019 (i.e. com­pletely miss­ing the deep learn­ing rev­olu­tion). The trend that changed in 2012 was that of the amount of com­pute ap­plied to deep learn­ing. I sus­pect trend ex­trap­o­la­tion with com­pute as the x-axis would do okay; trend ex­trap­o­la­tion with cal­en­dar year as the x-axis would do poorly. But as I men­tioned above, this is not a crux for me, since it doesn’t give me an in­side-view rea­son to ex­pect FOOM; I wouldn’t even con­sider it weak ev­i­dence for FOOM if I changed my mind on this. (If the data showed a big dis­con­ti­nu­ity, that would be ev­i­dence, but I’m fairly con­fi­dent that while there was a dis­con­ti­nu­ity it was rel­a­tively small.) • I’d like you to state what po­si­tion you think I’m ar­gu­ing for I think you’re ar­gu­ing for some­thing like: Con­di­tioned on [the first AGI is cre­ated at time by AI lab X], it is very un­likely that im­me­di­ately be­fore the re­searchers at X have a very low cre­dence in the propo­si­tion “we will cre­ate an AGI some­time in the next 30 days”. (Tbc, I did not in­ter­pret you as ar­gu­ing about timelines or AGI trans­for­ma­tive­ness; and nei­ther did I ar­gue about those things here.) I’m ar­gu­ing against FOOM, not about whether there will be a fire alarm. The fire alarm ques­tion seems or­thog­o­nal to me. Us­ing the “fire alarm” con­cept here was a mis­take, sorry for that. In­stead of writ­ing: I’m pretty ag­nos­tic about whether the re­sult of that$100M NAS would serve as a fire alarm for AGI.

I should have writ­ten:

I’m pretty ag­nos­tic about whether the re­sult of that $100M NAS would be “al­most AGI”. This sounds to me like say­ing “well, we can’t trust pre­dic­tions based on past data, and we don’t know that we won’t find an AGI, so we should worry about that”. I gen­er­ally have a vague im­pres­sion that many AIS/​x-risk peo­ple tend to place too much weight on trend ex­trap­o­la­tion ar­gu­ments in AI (or tend to not give enough at­ten­tion to im­por­tant de­tails of such ar­gu­ments), which may have trig­gered me to write the re­lated stuff (in re­sponse to you seem­ingly ap­ply­ing a trend ex­trap­o­la­tion ar­gu­ment with re­spect to NAS). I was not list­ing the rea­sons for my be­liefs speci­fi­cally about NAS. If I had in­finite time, I’d even­tu­ally con­sider these sce­nar­ios (even the simu­la­tors want­ing us to build a moon tower hy­poth­e­sis). (I’m mind­ful of your time and so I don’t want to branch out this dis­cus­sion into un­re­lated top­ics, but since this seems to me like a po­ten­tially im­por­tant point...) Even if we did have in­finite time and the abil­ity to some­how de­ter­mine the cor­rect­ness of any given hy­poth­e­sis with su­per-high-con­fi­dence, we may not want to eval­u­ate all hy­pothe­ses—that in­volve other agents—in ar­bi­trary or­der. Due to game the­o­ret­i­cal stuff, the or­der in which we do things may mat­ter (e.g. due to com­mit­ment races in log­i­cal time). For ex­am­ple, af­ter con­sid­er­ing some game-the­o­ret­i­cal meta con­sid­er­a­tions we might de­cide to make cer­tain bind­ing com­mit­ments be­fore eval­u­at­ing such and such hy­pothe­ses; or we might de­cide about what ad­di­tional things we should con­sider or do be­fore eval­u­at­ing some other hy­pothe­ses, etcetera. Con­di­tioned on the first AGI be­ing al­igned, it may be im­por­tant to figure out how do we make sure that that AGI “be­haves wisely” with re­spect to this topic (be­cause the AGI might be able to eval­u­ate a lot of weird hy­pothe­ses that we can’t). • Due to game the­o­ret­i­cal stuff, the or­der in which we do things may mat­ter (e.g. due to com­mit­ment races in log­i­cal time). Can you give me an ex­am­ple? I don’t see how this would work. (Tbc, I’m imag­in­ing that the uni­verse stops, and only I con­tinue think­ing; there are no other agents think­ing while I’m think­ing, and so afaict I should just im­ple­ment UDT.) • Creat­ing some sort of com­mit­ment de­vice that would bind us to fol­low UDT—be­fore we eval­u­ate some set of hy­pothe­ses—is an ex­am­ple for one po­ten­tially con­se­quen­tial in­ter­ven­tion. As an aside, my un­der­stand­ing is that in en­vi­ron­ments that in­volve mul­ti­ple UDT agents, UDT doesn’t nec­es­sar­ily work well (or is not even well-defined?). Also, if we would use SGD to train a model that ends up be­ing an al­igned AGI, maybe we should figure out how to make sure that that model “fol­lows” a good de­ci­sion the­ory. (Or does this hap­pen by de­fault? Does it de­pend on whether “fol­low­ing a good de­ci­sion the­ory” is helpful for min­i­miz­ing ex­pected loss on the train­ing set?) • Con­di­tioned on [the first AGI is cre­ated at time t by AI lab X], it is very un­likely that im­me­di­ately be­fore t the re­searchers at X have a very low cre­dence in the propo­si­tion “we will cre­ate an AGI some­time in the next 30 days”. It wasn’t ex­actly that (in par­tic­u­lar, I didn’t have the re­searcher’s be­liefs in mind), but I also be­lieve that state­ment for ba­si­cally the same rea­sons so that should be fine. There’s a lot of am­bi­guity in that state­ment (speci­fi­cally, what is AGI), but I prob­a­bly be­lieve it for most op­er­a­tional­iza­tions of AGI. (For refer­ence, I was con­sid­er­ing “will there be a 1 year dou­bling of eco­nomic out­put that started be­fore the first 4 year dou­bling of eco­nomic out­put ended”; for that it’s not suffi­cient to just ar­gue that we will get AGI sud­denly, you also have to ar­gue that the AGI will very quickly be­come su­per­in­tel­li­gent enough to dou­ble eco­nomic out­put in a very short amount of time.) I’m pretty ag­nos­tic about whether the re­sult of that$100M NAS would be “al­most AGI”.

I mean, the differ­ence be­tween a $100M NAS and a$1B NAS is:

• Up to 10x the num­ber of mod­els evaluated

• Up to 10x the size of mod­els evaluated

If you in­crease the num­ber of mod­els by 10x and leave the size the same, that some­what in­creases your op­ti­miza­tion power. If you model the NAS as pick­ing ar­chi­tec­tures ran­domly, the $1B NAS can have at most 10x the chance of find­ing AGI, re­gard­less of frag­ility, and so can only have at most 10x the ex­pected “value” (what­ever your no­tion of “value”). If you then also model ar­chi­tec­tures as non-frag­ile, then once you have some op­ti­miza­tion power, adding more op­ti­miza­tion power doesn’t make much of a differ­ence, e.g. the max of n draws from Uniform([0, 1]) has ex­pected value , so once n is already large (e.g. 100), in­creas­ing it makes ~no differ­ence. Of course, our ac­tual dis­tri­bu­tions will prob­a­bly be more bot­tom-heavy, but as dis­tri­bu­tions get more bot­tom-heavy we use gra­di­ent de­scent /​ evolu­tion­ary search to deal with that. For the size, it’s pos­si­ble that in­creases in size lead to huge in­creases in in­tel­li­gence, but that doesn’t seem to agree with ML prac­tice so far. Even if you ig­nore trend ex­trap­o­la­tion, I don’t see a rea­son to ex­pect that in­creas­ing model sizes should mean the differ­ence be­tween not-even-close-to-AGI and AGI. • If you model the NAS as pick­ing ar­chi­tec­tures randomly I don’t. NAS can be done with RL or evolu­tion­ary com­pu­ta­tion meth­ods. (Tbc, when I said I model a big part of con­tem­po­rary ML re­search as “trial and er­ror”, by trial and er­ror I did not mean ran­dom search.) If you then also model ar­chi­tec­tures as non-frag­ile, then once you have some op­ti­miza­tion power, adding more op­ti­miza­tion power doesn’t make much of a differ­ence, Ear­lier in this dis­cus­sion you defined frag­ility as the prop­erty “if you make even a slight change to the thing, then it breaks and doesn’t work”. While find­ing frag­ile solu­tions is hard, find­ing non-frag­ile solu­tion is not nec­es­sar­ily easy, so I don’t fol­low the logic of that para­graph. Sup­pose that all model ar­chi­tec­tures are in­deed non-frag­ile, and some of them can im­ple­ment AGI (call them “AGI ar­chi­tec­tures”). It may be the case that rel­a­tive to the set of model ar­chi­tec­tures that we can end up with when us­ing our fa­vorite method (e.g. evolu­tion­ary search), the AGI ar­chi­tec­tures are a tiny sub­set. E.g. the size ra­tio can be (and then run­ning our evolu­tion­ary search 10x times means roughly 10x prob­a­bil­ity of find­ing an AGI ar­chi­tec­ture, if [num­ber of runs]<<). • I don’t. NAS can be done with RL or evolu­tion­ary com­pu­ta­tion meth­ods. (Tbc, when I said I model a big part of con­tem­po­rary ML re­search as “trial and er­ror”, by trial and er­ror I did not mean ran­dom search.) I do think that similar con­clu­sions ap­ply there as well, though I’m not go­ing to make a math­e­mat­i­cal model for it. find­ing non-frag­ile solu­tion is not nec­es­sar­ily easy I’m not say­ing it is; I’m say­ing that how­ever hard it is to find a non-frag­ile good solu­tion, it is eas­ier to find a solu­tion that is al­most as good. When I say adding more op­ti­miza­tion power doesn’t make much of a difference I mean to im­ply that the ex­ist­ing op­ti­miza­tion power will do most of the work, for what­ever qual­ity of solu­tion you are get­ting. Sup­pose that all model ar­chi­tec­tures are in­deed non-frag­ile, and some of them can im­ple­ment AGI (call them “AGI ar­chi­tec­tures”). It may be the case that rel­a­tive to the set of model ar­chi­tec­tures that we can end up with when us­ing our fa­vorite method (e.g. evolu­tion­ary search), the AGI ar­chi­tec­tures are a tiny sub­set. E.g. the size ra­tio can be 10−10(and then run­ning our evolu­tion­ary search 10x times means roughly 10x prob­a­bil­ity of find­ing an AGI ar­chi­tec­ture, if [num­ber of runs]<<1010). (Aside: it would be way smaller than .) In this sce­nario, my ar­gu­ment is that the size ra­tio for “al­most-AGI ar­chi­tec­tures” is bet­ter (e.g. ), and so you’re more likely to find one of those first. In prac­tice, if you have a thou­sand pa­ram­e­ters that de­ter­mine an ar­chi­tec­ture, and 10 set­tings for each of them, the size ra­tio for the (as­sumed unique) globally best ar­chi­tec­ture is . In this set­ting, I ex­pect sev­eral or­ders of mag­ni­tude of differ­ence be­tween the size ra­tio of al­most-AGI and the size ra­tio of AGI, mak­ing it es­sen­tially guaran­teed that you find an al­most-AGI ar­chi­tec­ture be­fore an AGI ar­chi­tec­ture. • In this sce­nario, my ar­gu­ment is that the size ra­tio for “al­most-AGI ar­chi­tec­tures” is bet­ter (e.g. ), and so you’re more likely to find one of those first. For a “lo­cal search NAS” (rather than “ran­dom search NAS”) it seems that we should be con­sid­er­ing here the set of [“al­most-AGI ar­chi­tec­tures” from which the lo­cal search would not find an “AGI ar­chi­tec­ture”]. The “$1B NAS dis­con­ti­nu­ity sce­nario” al­lows for the $1B NAS to find “al­most-AGI ar­chi­tec­tures” be­fore find­ing an “AGI ar­chi­tec­ture”. • For a “lo­cal search NAS” (rather than “ran­dom search NAS”) it seems that we should be con­sid­er­ing here the set of [“al­most-AGI ar­chi­tec­tures” from which the lo­cal search would not find an “AGI ar­chi­tec­ture”]. The “$1B NAS dis­con­ti­nu­ity sce­nario” al­lows for the $1B NAS to find “al­most-AGI ar­chi­tec­tures” be­fore find­ing an “AGI ar­chi­tec­ture”. Agreed. My point is that the$100M NAS would find the al­most-AGI ar­chi­tec­tures. (My point with the size ra­tios is that what­ever crite­rion you use to say “and that’s why the $1B NAS finds AGI while the$100M NAS doesn’t”, my re­sponse would be that “well, al­most-AGI ar­chi­tec­tures re­quire a slightly eas­ier-to-achieve value of <crite­rion>, that the \$100M NAS would have achieved”.)

• I’ve seen the “ML gets de­ployed care­lessly” nar­ra­tive pop up on LW a bunch, and while it does seem ac­cu­rate in many cases, I wanted to note that there are counter-ex­am­ples. The most promi­nent counter-ex­am­ple I’m aware of is the in­cred­ibly cau­tious ap­proach Deep­Mind/​Google took when de­sign­ing the ML sys­tem that cools Google’s dat­a­cen­ters.

• This seems to be care­ful de­ploy­ment. The con­cept of de­ploy­ment is go­ing from an AI in the lab, to the same AI in con­trol of a real world sys­tem. Sup­pose your de­sign pro­cess was to fid­dle around in the lab un­til you make some­thing that seems to work. Once you have that, you look at it to un­der­stand why it works. You try to prove the­o­rems about it. You sub­ject it to some ex­ten­sive bat­tery of test­ing and will only put it in a self driv­ing car/​ data cen­ter cool­ing sys­tem once you are con­fi­dent it is safe.

There are two places this could fail. Your test­ing pro­ce­dures could be in­suffi­cient, or your AI could hack out of the lab be­fore the test­ing starts. I see lit­tle to no defense against the lat­ter.

• Would it be fair to sum­ma­rize your view here as “As­sum­ing no foom, we’ll be able to iter­ate, and that’s prob­a­bly enough.”?

• Hmm, I think I’d want to ex­plic­itly in­clude two other points, that are kind of in­cluded in that but don’t get com­mu­ni­cated well by that sum­mary:

• There may not be a prob­lem at all; per­haps by de­fault pow­er­ful AI sys­tems are not goal-di­rected.

• If there is a prob­lem, we’ll get ev­i­dence of its ex­is­tence be­fore it’s too late, and co­or­di­na­tion to not build prob­le­matic AI sys­tems will buy us ad­di­tional time.

• Cool, just wanted to make sure I’m en­gag­ing with the main ar­gu­ment here. With that out of the way...

• I gen­er­ally buy the “no foom ⇒ iter­ate ⇒ prob­a­bly ok” sce­nario. There are some caveats and qual­ifi­ca­tions, but broadly-defined “no foom” is a crux for me—I ex­pect at least some kind of de­ci­sive strate­gic ad­van­tage for early AGI, and would find the “al­igned by de­fault” sce­nario plau­si­ble in a no-foom world.

• I do not think that a lack of goal-di­rect­ed­ness is par­tic­u­larly rele­vant here. If an AI has ex­treme ca­pa­bil­ities, then a lack of goals doesn’t re­ally make it any safer. At some point I’ll prob­a­bly write a post about Don Nor­man’s fridge which talks about this in more depth, but the short ver­sion is: if we have an AI with ex­treme ca­pa­bil­ities but a con­fus­ing in­ter­face, then there’s a high chance that we all die, goal-di­rec­tion or not. In the “no foom” sce­nario, we’re as­sum­ing the AI won’t have those ex­treme ca­pa­bil­ities, but it’s foom vs no foom which mat­ters there, not goals vs no goals.

• I also dis­agree with co­or­di­na­tion hav­ing any hope what­so­ever if there is a prob­lem. There’s a huge unilat­er­al­ist prob­lem there, with mil­lions of peo­ple each eas­ily able to push the shiny red but­ton. I think straight-up solv­ing all of the tech­ni­cal al­ign­ment prob­lems would be much eas­ier than that co­or­di­na­tion prob­lem.

Look­ing at both the first and third point, I sus­pect that a sub-crux might be ex­pec­ta­tions about the re­source re­quire­ments (i.e. com­pute & data) needed for AGI. I ex­pect that, once we have the key con­cepts, hu­man-level AGI will be able to run in re­al­time on an or­di­nary lap­top. (Train­ing might re­quire more re­sources, at least early on. That would re­duce the unilat­er­al­ist prob­lem, but in­crease the chance of de­ci­sive strate­gic ad­van­tage due to the higher bar­rier to en­try.)

EDIT: to clar­ify, those sec­ond two points are both con­di­tioned on foom. Point be­ing, the only thing which ac­tu­ally mat­ters here is foom vs no foom:

• if there’s no foom, then we can prob­a­bly iter­ate, and then we’re prob­a­bly fine any­way (re­gard­less of goal-di­rec­tion, co­or­di­na­tion, etc).

• if there’s foom, then a lack of goal-di­rec­tion won’t help much, and co­or­di­na­tion is un­likely to work.

• the only thing which ac­tu­ally mat­ters here is foom vs no foom

Yeah, I think I mostly agree with this.

if we have an AI with ex­treme ca­pa­bil­ities but a con­fus­ing in­ter­face, then there’s a high chance that we all die

Yeah, I agree with that (as­sum­ing “ex­treme ca­pa­bil­ities” = re­ar­rang­ing atoms how­ever it sees fit, or some­thing of that na­ture), but why must it have a con­fus­ing in­ter­face? Couldn’t you just talk to it, and it would know what you mean? So I do think the goal-di­rected point does mat­ter.

I sus­pect that a sub-crux might be ex­pec­ta­tions about the re­source re­quire­ments (i.e. com­pute & data) needed for AGI. I ex­pect that, once we have the key con­cepts, hu­man-level AGI will be able to run in re­al­time on an or­di­nary lap­top.

I agree that this is a sub-crux. Note that I be­lieve that even­tu­ally hu­man-level AGI will be able to run on a lap­top, just that it will be pre­ceded by hu­man-level AGIs that take more com­pute.

Train­ing might re­quire more re­sources, at least early on. That would re­duce the unilat­er­al­ist prob­lem, but in­crease the chance of de­ci­sive strate­gic ad­van­tage due to the higher bar­rier to en­try.

I tend to think that if prob­lems arise, you’ve mostly lost already, so I’m ac­tu­ally hap­pier about de­ci­sive strate­gic ad­van­tage be­cause it re­duces com­pet­i­tive pres­sure.

But tbc, I broadly agree with all of your points, and do think that in FOOM wor­lds most of my ar­gu­ments don’t work. (Though I con­tinue to be con­fused what ex­actly a FOOM world looks like.)

• but why must it have a con­fus­ing in­ter­face? Couldn’t you just talk to it, and it would know what you mean?

That’s where the Don Nor­man part comes in. In­ter­faces to com­pli­cated sys­tems are con­fus­ing by de­fault. The gen­eral prob­lem of sys­tem­at­i­cally build­ing non-con­fus­ing in­ter­faces is, in my mind at least, roughly equiv­a­lent to the full tech­ni­cal prob­lem of AI al­ign­ment. (Writ­ing a pro­gram which knows what you mean is also, in my mind, roughly equiv­a­lent to the full tech­ni­cal prob­lem of AI al­ign­ment.) A word­ing which makes it more ob­vi­ous:

• The main prob­lem of AI al­ign­ment is to trans­late what a hu­man wants into a for­mat us­able by a machine

• The main prob­lem of user in­ter­face de­sign is to help/​al­low a hu­man to trans­late what they want into a for­mat us­able by a machine

Some­thing like e.g. tool AI puts more of the trans­la­tion bur­den on the hu­man, rather than on the AI, but that doesn’t make the trans­la­tion it­self any less difficult.

In a non-foomy world, the trans­la­tion doesn’t have to be perfect—hu­man­ity won’t be wiped out if the AI doesn’t quite perfectly un­der­stand what we mean. Ex­treme ca­pa­bil­ities make high-qual­ity trans­la­tion more im­por­tant, not just be­cause of Good­hart, but be­cause the trans­la­tion it­self will break down in sce­nar­ios very differ­ent from what hu­mans are used to. So if the AI has the ca­pa­bil­ities to achieve sce­nar­ios very differ­ent from what hu­mans are used to, then that trans­la­tion needs to be quite good.

• Do you agree that an AI with ex­treme ca­pa­bil­ities should know what you mean, even if it doesn’t act in ac­cor­dance with it? (This seems like an im­pli­ca­tion of “ex­treme ca­pa­bil­ities”.)

• No. The whole no­tion of a hu­man “mean­ing things” pre­sumes a cer­tain level of ab­strac­tion. One could imag­ine an AI sim­ply rea­son­ing about molecules or fields (or at least in­di­vi­d­ual neu­rons), with­out hav­ing any need for view­ing cer­tain chunks of mat­ter as hu­mans who mean things. In prin­ci­ple, no pre­dic­tive power what­so­ever would be lost in that view of the world.

That said, I do think that prob­lem is less cen­tral/​im­me­di­ate than the prob­lem of tak­ing an AI which does know what we mean, and point­ing at that AI’s con­cept-of-what-we-mean—i.e. in or­der to pro­gram the AI to do what we mean. Even if an AI learns a con­cept of hu­man val­ues, we still need to be able to point to that con­cept within the AI’s con­cept-space in or­der to ac­tu­ally al­ign it—and that means trans­lat­ing be­tween AI-no­tion-of-what-we-want and our-no­tion-of-what-we-want.

• That’s the crux for me; I ex­pect AI sys­tems that we build to be ca­pa­ble of “know­ing what you mean” (us­ing the ap­pro­pri­ate level of ab­strac­tion). They may also use other lev­els of ab­strac­tion, but I ex­pect them to be ca­pa­ble of us­ing that one.

Even if an AI learns a con­cept of hu­man val­ues, we still need to be able to point to that con­cept within the AI’s con­cept-space in or­der to ac­tu­ally al­ign it

Yes, I would call that the cen­tral prob­lem. (Though it would also be fine to build a poin­ter to a hu­man and have the AI “help the hu­man”, with­out nec­es­sar­ily point­ing to hu­man val­ues.)

• Yes, I would call that the cen­tral prob­lem. (Though it would also be fine to build a poin­ter to a hu­man and have the AI “help the hu­man”, with­out nec­es­sar­ily point­ing to hu­man val­ues.)

How would we do ei­ther of those things with­out work­able the­ory of em­bed­ded agency, ab­strac­tion, some idea of what kind-of-struc­ture hu­man val­ues have, etc?

• If you wanted a prov­able guaran­tee be­fore pow­er­ful AI sys­tems are ac­tu­ally built, you prob­a­bly can’t do it with­out the things you listed.

I’m claiming that as we get pow­er­ful AI sys­tems, we could figure out tech­niques that work with those AI sys­tems. They only ini­tially need to work for AI sys­tems that are around our level of in­tel­li­gence, and then we can im­prove our tech­niques in tan­dem with the AI sys­tems gain­ing in­tel­li­gence. In that set­ting, I’m rel­a­tively op­ti­mistic about things like “just train the AI to fol­low your in­struc­tions”; while this will break down in ex­otic cases or as the AI scales up, those cases are rare and hard to find.

• I’m not re­ally think­ing about prov­able guaran­tees per se. I’m just think­ing about how to point to the AI’s con­cept of hu­man val­ues—di­rectly point to it, not point to some proxy of it, be­cause prox­ies break down etc.

(Rough heuris­tic here: it is not pos­si­ble to point di­rectly at an ab­stract ob­ject in the ter­ri­tory. Even though a ter­ri­tory of­ten sup­ports cer­tain nat­u­ral ab­strac­tions, which are in­stru­men­tally con­ver­gent to learn/​use, we still can’t un­am­bigu­ously point to that ab­strac­tion in the ter­ri­tory—only in the map.)

A proxy is prob­a­bly good enough for a lot of ap­pli­ca­tions with lit­tle scale and few cor­ner cases. And if we’re do­ing some­thing like “train the AI to fol­low your in­struc­tions”, then a proxy is ex­actly what we’ll get. But if you want, say, an AI which “tries to help”—as op­posed to e.g. an AI which tries to look like it’s helping—then that means point­ing di­rectly to hu­man val­ues, not to a proxy.

Now, it is pos­si­ble that we could train an AI against a proxy, and it would end up point­ing to ac­tual hu­man val­ues in­stead, sim­ply due to im­perfect op­ti­miza­tion dur­ing train­ing. I think that’s what you have in mind, and I do think it’s plau­si­ble, even if sounds a bit crazy. Of course, with­out bet­ter the­o­ret­i­cal tools, we still wouldn’t have a way to di­rectly check even in hind­sight whether the AI ac­tu­ally wound up point­ing to hu­man val­ues or not. (Again, not talk­ing about prov­able guaran­tees here, I just want to be able to look at the AI’s own in­ter­nal data struc­tures and figure out (a) whether it has a no­tion of hu­man val­ues, and (b) whether it’s ac­tu­ally try­ing to act in ac­cor­dance with them, or just some­thing cor­re­lated with them.)

• it is pos­si­ble that we could train an AI against a proxy, and it would end up point­ing to ac­tual hu­man val­ues in­stead, sim­ply due to im­perfect op­ti­miza­tion dur­ing train­ing. I think that’s what you have in mind

Kind of, but not ex­actly.

I think that what­ever proxy is learned will not be a perfect poin­ter. I don’t know if there is such a thing as a “perfect poin­ter”, given that I don’t think there is a “right” an­swer to the ques­tion of what hu­man val­ues are, and con­se­quently I don’t think there is a “right” an­swer to what is helpful vs. not helpful.

I think the learned proxy will be a good enough poin­ter that the agent will not be ac­tively try­ing to kill us all, will let us cor­rect it, and will gen­er­ally do use­ful things. It seems likely that if the agent was mag­i­cally scaled up a lot, then bad things could hap­pen due to the er­rors in the poin­ter. But I’d hope that as the agent scales up, we im­prove and cor­rect the poin­ter (where “we” doesn’t have to be just hu­mans; it could also in­clude other AI as­sis­tants).

• It seems that the in­ter­vie­wees here ei­ther:

1. Use “AI risk” in a nar­rower way than I do.

2. Ne­glected to con­sider some sources/​forms of AI risk (see above link).

3. Have con­sid­ered other sources/​forms of AI risk but do not find them worth ad­dress­ing.

4. Are wor­ried about other sources/​forms of AI risk but they weren’t brought up dur­ing the in­ter­views.

Can you talk about which of these is the case for your­self (Ro­hin) and for any­one else whose think­ing you’re fa­mil­iar with? (Or if any of the other in­ter­vie­wees would like to chime in for them­selves?)

• For con­text, here’s the one time in the in­ter­view I men­tion “AI risk” (quot­ing 2 ear­lier para­graphs for con­text):

Paul Chris­ti­ano: I don’t know, the fu­ture is 10% worse than it would oth­er­wise be in ex­pec­ta­tion by virtue of our failure to al­ign AI. I made up 10%, it’s kind of a ran­dom num­ber. I don’t know, it’s less than 50%. It’s more than 10% con­di­tioned on AI soon I think.
[...]
Asya Ber­gal: I think my im­pres­sion is that that 10% is lower than some large set of peo­ple. I don’t know if other peo­ple agree with that.
Paul Chris­ti­ano: Cer­tainly, 10% is lower than lots of peo­ple who care about AI risk. I mean it’s worth say­ing, that I have this slightly nar­row con­cep­tion of what is the al­ign­ment prob­lem. I’m not in­clud­ing all AI risk in the 10%. I’m not in­clud­ing in some sense most of the things peo­ple nor­mally worry about and just in­clud­ing the like ‘we tried to build an AI that was do­ing what we want but then it wasn’t even try­ing to do what we want’. I think it’s lower now or even af­ter that caveat, than pes­simistic peo­ple. It’s go­ing to be lower than all the MIRI folks, it’s go­ing to be higher than al­most ev­ery­one in the world at large, es­pe­cially af­ter spe­cial­iz­ing in this prob­lem, which is a prob­lem al­most no one cares about, which is pre­cisely how a thou­sand full time peo­ple for 20 years can re­duce the whole risk by half or some­thing.

(But it’s still the case that asked “Can you ex­plain why it’s valuable to work on AI risk?” I re­sponded by al­most en­tirely talk­ing about AI al­ign­ment, since that’s what I work on and the kind of work where I have a strong view about cost-effec­tive­ness.)

• We dis­cussed this here for my in­ter­view; my an­swer is the same as it was then (ba­si­cally a com­bi­na­tion of 3 and 4). I don’t know about the other in­ter­vie­wees.

• I would guess that AI sys­tems will be­come more in­ter­pretable in the fu­ture, as they start us­ing the fea­tures /​ con­cepts /​ ab­strac­tions that hu­mans are us­ing.

This sort of rea­son­ing seems to as­sume that ab­strac­tion space is 1 di­men­sional, so AI must use hu­man con­cepts on the path from sub­hu­man to su­per­hu­man. I dis­agree. Like most things we don’t have strong rea­son to think is 1D, and which take many bits of info to de­scribe, ab­strac­tions seem high di­men­sional. So on the path from sub­hu­man to su­per­hu­man, the AI must use ab­strac­tions that are as pred­ica­tively use­ful as hu­man ab­strac­tions. Th­ese will not be any­thing like hu­man ab­strac­tions un­less the sys­tem was de­signed from a de­tailed neu­rolog­i­cal model of hu­mans. Any AI that hu­mans can rea­son about us­ing our in­built em­pa­thetic rea­son­ing is ba­si­cally a mind up­load, or a mind that differs from hu­man less than hu­mans differ from each other. This is not what ML will cre­ate. Hu­man un­der­stand­ing of AI sys­tems will have to be by ab­stract math­e­mat­i­cal rea­son­ing, the way we un­der­stand for­mal maths. Em­pa­thetic rea­son­ing about hu­man level AI is just ask­ing for an­thro­po­mor­phism. Our 3 op­tions are

1) An AI we don’t understand

2) An AI we can rea­son about in terms of maths.

3) A vir­tual hu­man.

• While I might agree with the three op­tions at the bot­tom, I don’t agree with the rea­son­ing to get there.

Ab­strac­tions are pretty heav­ily de­ter­mined by the ter­ri­tory. Hu­mans didn’t look at the world and pick out “tree” as an ab­stract con­cept be­cause of a bunch of hu­man-spe­cific fac­tors. “Tree” is a re­cur­ring pat­tern on earth, and even aliens would no­tice that same cluster of things, as­sum­ing they paid at­ten­tion. Even on the em­pathic front, you don’t need a hu­man-like mind in or­der to no­tice the com­mon pat­terns of hu­man be­hav­ior (in hu­mans) which we call “anger” or “sad­ness”.

• Ab­strac­tions are pretty heav­ily de­ter­mined by the ter­ri­tory.

+1, that’s my re­sponse as well.

• Some ab­strac­tions are heav­ily de­ter­mined by the ter­ri­tory. The con­cept of trees is pretty heav­ily de­ter­mined by the ter­ri­tory. Whereas the con­cept of be­trayal is de­ter­mined by the way that hu­man minds func­tion, which is de­ter­mined by other peo­ple’s ab­strac­tions. So while it seems rea­son­ably likely to me that an AI “nat­u­rally thinks” in terms of the same low-level ab­strac­tions as hu­mans, it think­ing in terms of hu­man high-level ab­strac­tions seems much less likely, ab­sent some type of safety in­ter­ven­tion. Which is par­tic­u­larly im­por­tant be­cause most of the key hu­man val­ues are very high-level ab­strac­tions.

• My guess is that if you have to deal with hu­mans, as at least early AI sys­tems will have to do, then ab­strac­tions like “be­trayal” are heav­ily de­ter­mined.

I agree that if you don’t have to deal with hu­mans, then things like “be­trayal” may not arise; similarly if you don’t have to deal with Earth, then “trees” are not heav­ily de­ter­mined ab­strac­tions.

• Neu­ral nets have around hu­man perfor­mance on Ima­genet.

If ab­strac­tion was a fea­ture of the ter­ri­tory, I would ex­pect the failure cases to be similar to hu­man failure cases. Look­ing at https://​​github.com/​​hendrycks/​​nat­u­ral-adv-ex­am­ples, This does not seem to be the case very strongly, but then again, some of them con­tain dark shiny stone be­ing clas­sified as a sea lion. The failures aren’t to­tally in­hu­man, the way they are with ad­ver­sar­ial ex­am­ples.

Hu­mans didn’t look at the world and pick out “tree” as an ab­stract con­cept be­cause of a bunch of hu­man-spe­cific fac­tors.

I am not say­ing that trees aren’t a cluster in thing space. What I am say­ing is that if there were many cluster in thing space that were as tight and pred­ica­tively use­ful as “Tree”, but were not pos­si­ble for hu­mans to con­cep­tu­al­ize, we wouldn’t know it. There are plenty of con­cepts that hu­mans didn’t de­velop for most of hu­man his­tory, de­spite those con­cepts be­ing pred­ica­tively use­ful, un­til an odd ge­nius came along or the con­cept was pinned down by mas­sive ex­per­i­men­tal ev­i­dence. Eg in­clu­sive ge­netic fit­ness, en­tropy ect.

Con­sider that evolu­tion op­ti­mized us in an en­vi­ron­ment that con­tained trees, and in which pre­dict­ing them was use­ful, so it would be more sur­pris­ing for there to be a con­cept that is use­ful in the an­ces­tral en­vi­ron­ment that we can’t un­der­stand, than a con­cept that we can’t un­der­stand in a non an­ces­tral do­main.

This looks like a map that is heav­ily de­ter­mined by the ter­ri­tory, but hu­man maps con­tain rivers and not ge­olog­i­cal rock for­ma­tions. There could be fea­tures that could be mapped that hu­mans don’t map.

If you be­lieve the post that

Even­tu­ally, suffi­ciently in­tel­li­gent AI sys­tems will prob­a­bly find even bet­ter con­cepts that are alien to us,

Then you can form an equally good, non­hu­man con­cept by tak­ing the bet­ter alien con­cept and adding ran­dom noise. Of course, an AI trained on text might share our con­cepts just be­cause our con­cepts are the most pred­ica­tively use­ful ways to pre­dict our writ­ing. I would also like to as­sign some prob­a­bil­ity to AI sys­tems that don’t use any­thing rec­og­niz­able as a con­cept. You might be able to say 90% of blue ob­jects are egg shaped, 95% of cubes are red … 80% of furred ob­jects that glow in the dark are flex­ible … with­out ever split­ting ob­jects into bleggs and rubes. Seen from this per­spec­tive, you have a den­sity func­tion over thingspace, and a sum of clusters might not be the best way to de­scribe it. AIXI never talks about trees, it just simu­lates ev­ery quan­tum. Maybe there are fast al­gorithms that don’t even as­cribe dis­crete con­cepts.

• Neu­ral nets have around hu­man perfor­mance on Ima­genet.

But those trained neu­ral nets are very sub­hu­man on other image un­der­stand­ing tasks.

Then you can form an equally good, non­hu­man con­cept by tak­ing the bet­ter alien con­cept and adding ran­dom noise.

I would ex­pect that the alien con­cepts are some­thing we haven’t figured out be­cause we don’t have enough data or com­pute or logic or some other re­source, and that con­straint will also ap­ply to the AI. If you take that con­cept and “add ran­dom noise” (which I don’t re­ally un­der­stand), it would pre­sum­ably still re­quire the same amount of re­sources, and so the AI still won’t find it.

For the rest of your com­ment, I agree that we can’t the­o­ret­i­cally rule those sce­nar­ios out, but there’s no the­o­ret­i­cal rea­son to rule them in ei­ther. So far the em­piri­cal ev­i­dence seems to me to be in fa­vor of “ab­strac­tions are de­ter­mined by the ter­ri­tory”, e.g. ImageNet neu­ral nets seems to have hu­man-in­ter­pretable low-level ab­strac­tions (edge de­tec­tors, curve de­tec­tors, color de­tec­tors), while hav­ing strange high-level ab­strac­tions; I claim that the strange high-level ab­strac­tions are bad and only work on ImageNet be­cause they were speci­fi­cally de­signed to do so and ImageNet is suffi­ciently nar­row that you can get to good perfor­mance with bad ab­strac­tions.

• By adding ran­dom noise, I meant adding wig­gles to the edge of the set in thingspace for ex­am­ple adding noise to “bird” might ex­clude “os­trich” and in­clude “duck bill platy­pus”.

I agree that the high level image net con­cepts are bad in this sense, how­ever are they just bad. If they were just bad and the limit to find­ing good con­cepts was data or some other re­source, then we should ex­pect small chil­dren and men­tally im­paired peo­ple to have similarly bad con­cepts. This would sug­gest a sin­gle gra­di­ent from bet­ter to worse. If how­ever cur­rent neu­ral net­works used con­cepts sub­stan­tially differ­ent from small chil­dren, and not just uniformly worse or uniformly bet­ter, that would show differ­ent sets of con­cepts at the same low level. This would be fairly strong ev­i­dence of mul­ti­ple con­cepts at the smart hu­man level.

I would also want to point out that a small frac­tion of the con­cepts be­ing differ­ent would be enough to make al­ign­ment much harder. Even if their was a perfect scale, if 13 of the con­cepts are sub­hu­man, 13 hu­man level and 13 su­per­hu­man, it would be hard to un­der­stand the sys­tem. To get any safety, you need to get your sys­tem very close to hu­man con­cepts. And you need to be con­fi­dant that you have hit this tar­get.

• From the tran­script with Paul Chris­ti­ano.

Plus, most things can’t de­stroy the ex­pected value of the fu­ture by 10%. You just can’t have that many things, oth­er­wise there’s not go­ing to be any value left in the end.

I don’t un­der­stand. Maybe it is just the case that there’s no value left af­ter a large num­ber of things that re­duces the ex­pected value by 10%?

• Paul is im­plic­itly con­di­tion­ing his ac­tions on be­ing in a world where there’s a de­cent amount of ex­pected value left for his ac­tions to af­fect. This is tech­ni­cally part of a de­ci­sion pro­ce­dure, rather than a state­ment about epistemic cre­dences, but it’s con­fus­ing be­cause he frames it as an epistemic cre­dence.

• The biggest dis­agree­ment be­tween me and more pes­simistic re­searchers is that I think grad­ual take­off is much more likely than dis­con­tin­u­ous take­off (and in fact, the first, third and fourth para­graphs above are quite weak if there’s a dis­con­tin­u­ous take­off).

It’s been ar­gued be­fore that Con­tin­u­ous is not the same as Slow by any nor­mal stan­dard, so the strat­egy of ‘deal­ing with things as they come up’, while more vi­able un­der a con­tin­u­ous sce­nario, will prob­a­bly not be suffi­cient.

It seems to me like you’re as­sum­ing longter­mists are very likely not re­quired at all in a case where progress is con­tin­u­ous. I take con­tin­u­ous to just mean that we’re in a world where there won’t be sud­den jumps in ca­pa­bil­ity, or ap­par­ently use­less sys­tems sud­denly cross­ing some thresh­old and be­com­ing su­per­in­tel­li­gent, not where progress is slow or easy to re­verse. We could still pick a com­pletely wrong ap­proach that makes al­ign­ment much more difficult and set our­selves on a likely path to­wards dis­aster, even if the fol­low­ing is true:

So far as I can tell, the best one-line sum­mary for why we should ex­pect a con­tin­u­ous and not a fast take­off comes from the in­ter­view Paul Chris­ti­ano gave on the 80k pod­cast: ‘I think if you op­ti­mize AI sys­tems for rea­son­ing, it ap­pears much, much ear­lier.’
So far as I can tell, Paul’s point is that ab­sent spe­cific rea­sons to think oth­er­wise, the prima fa­cie case that any time we are try­ing hard to op­ti­mize for some crite­ria, we should ex­pect the ‘many small changes that add up to one big effect’ situ­a­tion.
Then he goes on to ar­gue that the spe­cific ar­gu­ments that AGI is a rare case where this isn’t true (like nu­clear weapons) are ei­ther wrong or aren’t strong enough to make dis­con­tin­u­ous progress plau­si­ble.

In a world where con­tin­u­ous but mod­er­ately fast take­off is likely, I can eas­ily imag­ine doom sce­nar­ios that would re­quire long term strat­egy or con­cep­tual re­search early on to avoid, even if none of them in­volve FOOM. Imag­ine that the ac­cepted stan­dard for al­igned AI is fol­lows some par­tic­u­lar re­search agenda, like Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing, but it turns out that CIRL starts to be­have patholog­i­cally and tries to wire­head it­self as it gets more and more ca­pa­ble, and that its a fairly deep flaw that we can only patch and not avoid.

Let’s say that over the course of a cou­ple of years failures of CIRL sys­tems start to ap­pear and com­pound very rapidly un­til they con­sti­tute an Ex­is­ten­tial dis­aster. Maybe peo­ple re­al­ize what’s go­ing on, but by then it would be too late, be­cause the right ap­proach would have been to try some other ap­proach to AI al­ign­ment but the re­search to do that doesn’t ex­ist and can’t be done any­where near fast enough. Like Paul Chris­ti­ano’s what failure looks like

• In the situ­a­tions you de­scribe, I would still be some­what op­ti­mistic about co­or­di­na­tion. But yeah, such situ­a­tions lead­ing to doom seem plau­si­ble, and this is why the es­ti­mate is 90% in­stead of 95% or 99%. (Though note that the num­bers are very rough.)

• Nice to see that there are not just rad­i­cal po­si­tions in the AI safety crowd, and there is a drift away from alarmism and to­wards “let’s try var­i­ous ap­proaches, iter­ate and see what we can learn” in­stead of “we must figure out AI safety first, or else!” Also, Chris­ti­ano’s ap­proach “let’s at least en­sure we can build some­thing rea­son­ably safe for the near term”, since one way or an­other, some­thing will get built, has at least a chance of suc­cess.

My per­sonal guess, as some­one who knows noth­ing about ML and very lit­tle about AI safety, but a non-zero amount about re­search and de­vel­op­ment in gen­eral, is that the em­bed­ded agency prob­lems are way too deep to be satis­fac­to­rily re­solved be­fore ML gets the AI to the level of an av­er­age pro­gram­mer. But maybe MIRI, like the NSA, has a few tricks up its sleeve that are not visi­ble to the gen­eral pub­lic. Though this does not seem likely, oth­er­wise a lot of the re­cent dis­cus­sions of em­bed­ded agency would be smoke and mir­rors, not some­thing MIRI is likely to en­gage in.

• ″ There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left. ”

This is the ar­gu­ment from con­se­quences fal­lacy. There may be many things that could de­stroy the fu­ture with high prob­a­bil­ity and we are sim­ply doomed BUT the more in­ter­est­ing sce­nario and a much bet­ter work­ing as­sump­tion is that there po­ten­tially dan­ger­ous things that are likely to de­stroy the fu­ture IF we don’t seek to un­der­stand them and try to cor­rect them by con­certed effort as op­posed to con­tin­u­ing on as we do now with teh level of effort and con­cern we have now.