[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).

Wel­come to an­other spe­cial edi­tion of the newslet­ter! In this edi­tion, I sum­ma­rize four con­ver­sa­tions that AI Im­pacts had with re­searchers who were op­ti­mistic that AI safety would be solved “by de­fault”. (Note that one of the con­ver­sa­tions was with me.)

While all four of these con­ver­sa­tions cov­ered very differ­ent top­ics, I think there were three main points of con­ver­gence. First, we were rel­a­tively un­con­vinced by the tra­di­tional ar­gu­ments for AI risk, and find dis­con­ti­nu­ities rel­a­tively un­likely. Se­cond, we were more op­ti­mistic about solv­ing the prob­lem in the fu­ture, when we know more about the prob­lem and have more ev­i­dence about pow­er­ful AI sys­tems. And fi­nally, we were more op­ti­mistic that as we get more ev­i­dence of the prob­lem in the fu­ture, the ex­ist­ing ML com­mu­nity will ac­tu­ally try to fix that prob­lem.

Con­ver­sa­tion with Paul Chris­ti­ano (Paul Chris­ti­ano, Asya Ber­gal, Ronny Fer­nan­dez, and Robert Long) (sum­ma­rized by Ro­hin): There can’t be too many things that re­duce the ex­pected value of the fu­ture by 10%; if there were, there would be no ex­pected value left (ETA: see this com­ment). So, the prior that any par­tic­u­lar thing has such an im­pact should be quite low. With AI in par­tic­u­lar, ob­vi­ously we’re go­ing to try to make AI sys­tems that do what we want them to do. So start­ing from this po­si­tion of op­ti­mism, we can then eval­u­ate the ar­gu­ments for doom. The two main ar­gu­ments: first, we can’t dis­t­in­guish ahead of time be­tween AIs that are try­ing to do the right thing, and AIs that are try­ing to kill us, be­cause the lat­ter will be­have nicely un­til they can ex­e­cute a treach­er­ous turn. Se­cond, since we don’t have a crisp con­cept of “do­ing the right thing”, we can’t se­lect AI sys­tems on whether they are do­ing the right thing.

How­ever, there are many “sav­ing throws”, or ways that the ar­gu­ment could break down, avoid­ing doom. Per­haps there’s no prob­lem at all, or per­haps we can cope with it with a lit­tle bit of effort, or per­haps we can co­or­di­nate to not build AIs that de­stroy value. Paul as­signs a de­cent amount of prob­a­bil­ity to each of these (and other) sav­ing throws, and any one of them suffices to avoid doom. This leads Paul to es­ti­mate that AI risk re­duces the ex­pected value of the fu­ture by roughly 10%, a rel­a­tively op­ti­mistic num­ber. Since it is so ne­glected, con­certed effort by longter­mists could re­duce it to 5%, mak­ing it still a very valuable area for im­pact. The main way he ex­pects to change his mind is from ev­i­dence from more pow­er­ful AI sys­tems, e.g. as we build more pow­er­ful AI sys­tems, per­haps in­ner op­ti­mizer con­cerns will ma­te­ri­al­ize and we’ll see ex­am­ples where an AI sys­tem ex­e­cutes a non-catas­trophic treach­er­ous turn.

Paul also be­lieves that clean al­gorith­mic prob­lems are usu­ally solv­able in 10 years, or prov­ably im­pos­si­ble, and early failures to solve a prob­lem don’t provide much ev­i­dence of the difficulty of the prob­lem (un­less they gen­er­ate proofs of im­pos­si­bil­ity). So, the fact that we don’t know how to solve al­ign­ment now doesn’t provide very strong ev­i­dence that the prob­lem is im­pos­si­ble. Even if the clean ver­sions of the prob­lem were im­pos­si­ble, that would sug­gest that the prob­lem is much more messy, which re­quires more con­certed effort to solve but also tends to be just a long list of rel­a­tively easy tasks to do. (In con­trast, MIRI thinks that pro­saic AGI al­ign­ment is prob­a­bly im­pos­si­ble.)

Note that even find­ing out that the prob­lem is im­pos­si­ble can help; it makes it more likely that we can all co­or­di­nate to not build dan­ger­ous AI sys­tems, since no one wants to build an un­al­igned AI sys­tem. Paul thinks that right now the case for AI risk is not very com­pel­ling, and so peo­ple don’t care much about it, but if we could gen­er­ate more com­pel­ling ar­gu­ments, then they would take it more se­ri­ously. If in­stead you think that the case is already com­pel­ling (as MIRI does), then you would be cor­re­spond­ingly more pes­simistic about oth­ers tak­ing the ar­gu­ments se­ri­ously and co­or­di­nat­ing to avoid build­ing un­al­igned AI.

One po­ten­tial rea­son MIRI is more doomy is that they take a some­what broader view of AI safety: in par­tic­u­lar, in ad­di­tion to build­ing an AI that is try­ing to do what you want it to do, they would also like to en­sure that when the AI builds suc­ces­sors, it does so well. In con­trast, Paul sim­ply wants to leave the next gen­er­a­tion of AI sys­tems in at least as good a situ­a­tion as we find our­selves in now, since they will be both bet­ter in­formed and more in­tel­li­gent than we are. MIRI has also pre­vi­ously defined al­igned AI as one that pro­duces good out­comes when run, which is a much broader con­cep­tion of the prob­lem than Paul has. But prob­a­bly the main dis­agree­ment be­tween MIRI and ML re­searchers and that ML re­searchers ex­pect that we’ll try a bunch of stuff, and some­thing will work out, whereas MIRI ex­pects that the prob­lem is re­ally hard, such that trial and er­ror will only get you solu­tions that ap­pear to work.

Ro­hin’s opinion: A gen­eral theme here seems to be that MIRI feels like they have very strong ar­gu­ments, while Paul thinks that they’re plau­si­ble ar­gu­ments, but aren’t ex­tremely strong ev­i­dence. Sim­ply hav­ing a lot more un­cer­tainty leads Paul to be much more op­ti­mistic. I agree with most of this.

How­ever, I do dis­agree with the point about “clean” prob­lems. I agree that clean al­gorith­mic prob­lems are usu­ally solved within 10 years or are prov­ably im­pos­si­ble, but it doesn’t seem to me like AI risk counts as a clean al­gorith­mic prob­lem: we don’t have a nice for­mal state­ment of the prob­lem that doesn’t rely on in­tu­itive con­cepts like “op­ti­miza­tion”, “try­ing to do some­thing”, etc. This sug­gests to me that AI risk is more “messy”, and so may re­quire more time to solve.

Con­ver­sa­tion with Ro­hin Shah (Ro­hin Shah, Asya Ber­gal, Robert Long, and Sara Hax­hia) (sum­ma­rized by Ro­hin): The main rea­son I am op­ti­mistic about AI safety is that we will see prob­lems in ad­vance, and we will solve them, be­cause no­body wants to build un­al­igned AI. A likely crux is that I think that the ML com­mu­nity will ac­tu­ally solve the prob­lems, as op­posed to ap­ply­ing a bandaid fix that doesn’t scale. I don’t know why there are differ­ent un­der­ly­ing in­tu­itions here.

In ad­di­tion, many of the clas­sic ar­gu­ments for AI safety in­volve a sys­tem that can be de­com­posed into an ob­jec­tive func­tion and a world model, which I sus­pect will not be a good way to model fu­ture AI sys­tems. In par­tic­u­lar, cur­rent sys­tems trained by RL look like a grab bag of heuris­tics that cor­re­late well with ob­tain­ing high re­ward. I think that as AI sys­tems be­come more pow­er­ful, the heuris­tics will be­come more and more gen­eral, but they still won’t de­com­pose nat­u­rally into an ob­jec­tive func­tion, a world model, and search. In ad­di­tion, we can look at hu­mans as an ex­am­ple: we don’t fully pur­sue con­ver­gent in­stru­men­tal sub­goals; for ex­am­ple, hu­mans can be con­vinced to pur­sue differ­ent goals. This makes me more skep­ti­cal of tra­di­tional ar­gu­ments.

I would guess that AI sys­tems will be­come more in­ter­pretable in the fu­ture, as they start us­ing the fea­tures /​ con­cepts /​ ab­strac­tions that hu­mans are us­ing. Even­tu­ally, suffi­ciently in­tel­li­gent AI sys­tems will prob­a­bly find even bet­ter con­cepts that are alien to us, but if we only con­sider AI sys­tems that are (say) 10x more in­tel­li­gent than us, they will prob­a­bly still be us­ing hu­man-un­der­stand­able con­cepts. This should make al­ign­ment and over­sight of these sys­tems sig­nifi­cantly eas­ier. For sig­nifi­cantly stronger sys­tems, we should be del­e­gat­ing the prob­lem to the AI sys­tems that are 10x more in­tel­li­gent than us. (This is very similar to the pic­ture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been pub­lished and I was not aware of Chris’s views at the time of this con­ver­sa­tion.)

I’m also less wor­ried about race dy­nam­ics in­creas­ing ac­ci­dent risk than the me­dian re­searcher. The benefit of rac­ing a lit­tle bit faster is to have a lit­tle bit more power /​ con­trol over the fu­ture, while also in­creas­ing the risk of ex­tinc­tion a lit­tle bit. This seems like a bad trade from each agent’s per­spec­tive. (That is, the Nash equil­ibrium is for all agents to be cau­tious, be­cause the po­ten­tial up­side of rac­ing is small and the po­ten­tial down­side is large.) I’d be more wor­ried if [AI risk is real AND not ev­ery­one agrees AI risk is real when we have pow­er­ful AI sys­tems], or if the po­ten­tial up­side was larger (e.g. if rac­ing a lit­tle more made it much more likely that you could achieve a de­ci­sive strate­gic ad­van­tage).

Over­all, it feels like there’s around 90% chance that AI would not cause x-risk with­out ad­di­tional in­ter­ven­tion by longter­mists. The biggest dis­agree­ment be­tween me and more pes­simistic re­searchers is that I think grad­ual take­off is much more likely than dis­con­tin­u­ous take­off (and in fact, the first, third and fourth para­graphs above are quite weak if there’s a dis­con­tin­u­ous take­off). If I con­di­tion on dis­con­tin­u­ous take­off, then I mostly get very con­fused about what the world looks like, but I also get a lot more wor­ried about AI risk, es­pe­cially be­cause the “AI is to hu­mans as hu­mans are to ants” anal­ogy starts look­ing more ac­cu­rate. In the in­ter­view I said 70% chance of doom in this world, but with way more un­cer­tainty than any of the other cre­dences, be­cause I’m re­ally con­fused about what that world looks like. Two other dis­agree­ments, be­sides the ones above: I don’t buy Real­ism about ra­tio­nal­ity (AN #25), whereas I ex­pect many pes­simistic re­searchers do. I may also be more pes­simistic about our abil­ity to write proofs about fuzzy con­cepts like those that arise in al­ign­ment.

On timelines, I es­ti­mated a very rough 50% chance of AGI within 20 years, and 30-40% chance that it would be us­ing “es­sen­tially cur­rent tech­niques” (which is ob­nox­iously hard to define). Con­di­tional on both of those, I es­ti­mated 70% chance that it would be some­thing like a mesa op­ti­mizer; mostly be­cause op­ti­miza­tion is a very use­ful in­stru­men­tal strat­egy for solv­ing many tasks, es­pe­cially be­cause gra­di­ent de­scent and other cur­rent al­gorithms are very weak op­ti­miza­tion al­gorithms (rel­a­tive to e.g. hu­mans), and so learned op­ti­miza­tion al­gorithms will be nec­es­sary to reach hu­man lev­els of sam­ple effi­ciency.

Ro­hin’s opinion: Look­ing over this again, I’m re­al­iz­ing that I didn’t em­pha­size enough that most of my op­ti­mism comes from the more out­side view type con­sid­er­a­tions: that we’ll get warn­ing signs that the ML com­mu­nity won’t ig­nore, and that the AI risk ar­gu­ments are not wa­ter­tight. The other parts are par­tic­u­lar in­side view dis­agree­ments that make me more op­ti­mistic, but they don’t fac­tor in much into my op­ti­mism be­sides be­ing ex­am­ples of how the meta con­sid­er­a­tions could play out. I’d recom­mend this com­ment of mine to get more of a sense of how the meta con­sid­er­a­tions fac­tor into my think­ing.

I was also glad to see that I still broadly agree with things I said ~5 months ago (since no ma­jor new op­pos­ing ev­i­dence has come up since then), though as I men­tioned above, I would now change what I place em­pha­sis on.

Con­ver­sa­tion with Robin Han­son (Robin Han­son, Asya Ber­gal, and Robert Long) (sum­ma­rized by Ro­hin): The main theme of this con­ver­sa­tion is that AI safety does not look par­tic­u­larly com­pel­ling on an out­side view. Progress in most ar­eas is rel­a­tively in­cre­men­tal and con­tin­u­ous; we should ex­pect the same to be true for AI, sug­gest­ing that timelines should be quite long, on the or­der of cen­turies. The cur­rent AI boom looks similar to pre­vi­ous AI booms, which didn’t amount to much in the past.

Timelines could be short if progress in AI were “lumpy”, as in a FOOM sce­nario. This could hap­pen if in­tel­li­gence was one sim­ple thing that just has to be dis­cov­ered, but Robin ex­pects that in­tel­li­gence is ac­tu­ally a bunch of not-very-gen­eral tools that to­gether let us do many things, and we sim­ply have to find all of these tools, which will pre­sum­ably not be lumpy. Most of the value from tools comes from more spe­cific, nar­row tools, and in­tel­li­gence should be similar. In ad­di­tion, the liter­a­ture on hu­man unique­ness sug­gests that it wasn’t “raw in­tel­li­gence” or small changes to brain ar­chi­tec­ture that makes hu­mans unique, it’s our abil­ity to pro­cess cul­ture (com­mu­ni­cat­ing via lan­guage, learn­ing from oth­ers, etc).

In any case, many re­searchers are now dis­tanc­ing them­selves from the FOOM sce­nario, and are in­stead ar­gu­ing that AI risk oc­curs due to stan­dard prin­ci­pal-agency prob­lems, in the situ­a­tion where the agent (AI) is much smarter than the prin­ci­pal (hu­man). Robin thinks that this doesn’t agree with the ex­ist­ing liter­a­ture on prin­ci­pal-agent prob­lems, in which losses from prin­ci­pal-agent prob­lems tend to be bounded, even when the agent is smarter than the prin­ci­pal.

You might think that since the stakes are so high, it’s worth work­ing on it any­way. Robin agrees that it’s worth hav­ing a few peo­ple (say a hun­dred) pay at­ten­tion to the prob­lem, but doesn’t think it’s worth spend­ing a lot of effort on it right now. Effort is much more effec­tive and use­ful once the prob­lem be­comes clear, or once you are work­ing with a con­crete de­sign; we have nei­ther of these right now and so we should ex­pect that most effort ends up be­ing in­effec­tive. It would be bet­ter if we saved our re­sources for the fu­ture, or if we spent time think­ing about other ways that the fu­ture could go (as in his book, Age of Em).

It’s es­pe­cially bad that AI safety has thou­sands of “fans”, be­cause this leads to a “cry­ing wolf” effect—even if the re­searchers have sub­tle, nu­anced be­liefs, they can­not con­trol the mes­sage that the fans con­vey, which will not be nu­anced and will in­stead con­fi­dently pre­dict doom. Then when doom doesn’t hap­pen, peo­ple will learn not to be­lieve ar­gu­ments about AI risk.

Ro­hin’s opinion: In­ter­est­ingly, I agree with al­most all of this, even though it’s (kind of) ar­gu­ing that I shouldn’t be do­ing AI safety re­search at all. The main place I dis­agree is that losses from prin­ci­pal-agent prob­lems with perfectly ra­tio­nal agents are bounded—this seems crazy to me, and I’d be in­ter­ested in spe­cific pa­per recom­men­da­tions (though note I and oth­ers have searched and not found many).

On the point about lump­iness, my model is that there are only a few un­der­ly­ing fac­tors (such as the abil­ity to pro­cess cul­ture) that al­low hu­mans to so quickly learn to do so many tasks, and al­most all tasks re­quire near-hu­man lev­els of these fac­tors to be done well. So, once AI ca­pa­bil­ities on these fac­tors reach ap­prox­i­mately hu­man level, we will “sud­denly” start to see AIs beat­ing hu­mans on many tasks, re­sult­ing in a “lumpy” in­crease on the met­ric of “num­ber of tasks on which AI is su­per­hu­man” (which seems to be the met­ric that peo­ple of­ten use, though I don’t like it, pre­cisely be­cause it seems like it wouldn’t mea­sure progress well un­til AI be­comes near-hu­man-level).

Con­ver­sa­tion with Adam Gleave (Adam Gleave et al) (sum­ma­rized by Ro­hin): Adam finds the tra­di­tional ar­gu­ments for AI risk un­con­vinc­ing. First, it isn’t clear that we will build an AI sys­tem that is so ca­pa­ble that it can fight all of hu­man­ity from its ini­tial po­si­tion where it doesn’t have any re­sources, le­gal pro­tec­tions, etc. While dis­con­tin­u­ous progress in AI could cause this, Adam doesn’t see much rea­son to ex­pect such dis­con­tin­u­ous progress: it seems like AI is pro­gress­ing by us­ing more com­pu­ta­tion rather than find­ing fun­da­men­tal in­sights. Se­cond, we don’t know how difficult AI safety will turn out to be; he gives a prob­a­bil­ity of ~10% that the prob­lem is as hard as (a car­i­ca­ture of) MIRI sug­gests, where any de­sign not based on math­e­mat­i­cal prin­ci­ples will be un­safe. This is es­pe­cially true be­cause as we get closer to AGI we’ll have many more pow­er­ful AI tech­niques that we can lev­er­age for safety. Thirdly, Adam does ex­pect that AI re­searchers will even­tu­ally solve safety prob­lems; they don’t right now be­cause it seems pre­ma­ture to work on those prob­lems. Adam would be more wor­ried if there were more arms race dy­nam­ics, or more em­piri­cal ev­i­dence or solid the­o­ret­i­cal ar­gu­ments in sup­port of spec­u­la­tive con­cerns like in­ner op­ti­miz­ers. He would be less wor­ried if AI re­searchers spon­ta­neously started to work on rel­a­tive prob­lems (more than they already do).

Adam makes the case for AI safety work differ­ently. At the high­est level, it seems pos­si­ble to build AGI, and some or­ga­ni­za­tions are try­ing very hard to build AGI, and if they suc­ceed it would be trans­for­ma­tive. That alone is enough to jus­tify some effort into mak­ing sure such a tech­nol­ogy is used well. Then, look­ing at the field it­self, it seems like the field is not cur­rently fo­cused on do­ing good sci­ence and en­g­ineer­ing to build safe, re­li­able sys­tems. So there is an op­por­tu­nity to have an im­pact by push­ing on safety and re­li­a­bil­ity. Fi­nally, there are sev­eral tech­ni­cal prob­lems that we do need to solve be­fore AGI, such as how we get in­for­ma­tion about what hu­mans ac­tu­ally want.

Adam also thinks that it’s 40-50% likely that when we build AGI, a PhD the­sis de­scribing it would be un­der­stand­able by re­searchers to­day with­out too much work, but ~50% that it’s some­thing rad­i­cally differ­ent. How­ever, it’s only 10-20% likely that AGI comes only from small vari­a­tions of cur­rent tech­niques (i.e. by vastly in­creas­ing data and com­pute). He would see this as more likely if we hit ad­di­tional mile­stones by in­vest­ing more com­pute and data (OpenAI Five was an ex­am­ple of such a mile­stone).

Ro­hin’s opinion: I broadly agree with all of this, with two main differ­ences. First, I am less wor­ried about some of the tech­ni­cal prob­lems that Adam men­tions, such as how to get in­for­ma­tion about what hu­mans want, or how to im­prove the ro­bust­ness of AI sys­tems, and more con­cerned about the more tra­di­tional prob­lem of how to cre­ate an AI sys­tem that is try­ing to do what you want. Se­cond, I am more bullish on the cre­ation of AGI us­ing small vari­a­tions on cur­rent tech­niques, but vastly in­creas­ing com­pute and data (I’d as­sign ~30%, while Adam as­signs 10-20%).