Sources of intuitions and data on AGI

Much of the difficulty in mak­ing progress on AI safety comes from the lack of use­ful feed­back loops. We do not have a su­per­in­tel­li­gent AI to run tests on and by the time we do, it will prob­a­bly be too late. This means we have to re­sort to us­ing prox­ies. In this post, I will high­light what I think are the four most use­ful prox­ies we have ac­cess to to­day: mod­ern ma­chine learn­ing sys­tems, hu­mans, groups, and ab­stract mod­els.

Th­ese prox­ies are not op­posed to each other. We need all the data we can get, and the right strat­egy is to learn from all of them. How­ever each proxy also has its flaws, and will break if used in the wrong way. It is im­por­tant to pay at­ten­tion to where each proxy works and where it fails.

Modern Ma­chine Learn­ing Systems

Pros: With ML sys­tems, one can do ex­per­i­ments quickly to get fast feed­back loops with lots of quan­ti­ta­tive data. Also, AGI will prob­a­bly be made us­ing some­thing similar to mod­ern ma­chine learn­ing sys­tems, so in­sights gained by think­ing about ML sys­tems to­day may be par­tic­u­larly easy to port over.

Cons: Modern ML sys­tems are opaque, and it can be very difficult to un­der­stand how they are work­ing. This may be a prob­lem we need to over­come even­tu­ally any­way, but that doesn’t make it any eas­ier to get data from them to­day.

Modern ML sys­tems are also not very smart, and it is rea­son­able to ex­pect a qual­i­ta­tive differ­ence be­tween their be­hav­ior and the be­hav­ior of su­per­in­tel­li­gent sys­tems. This may lead to a false sense of se­cu­rity, es­pe­cially in solu­tions that de­pend im­plic­itly or ex­plic­itly on the AI lack­ing some par­tic­u­lar ca­pa­bil­ity, or us­ing a par­tic­u­lar strat­egy.

Ex­am­ples: Con­crete Prob­lems in AI Safety and Align­ment for Ad­vanced Ma­chine Learn­ing Sys­tems both try to rea­son about AI Safety by us­ing mod­ern ma­chine learn­ing as their pri­mary source of in­tu­itions. The safety teams at OpenAI and Deep­Mind try to use this as a pri­mary source of data. (Although it would be a very un­char­i­ta­ble to claim that they do not also draw from the other sources, es­pe­cially ab­stract mod­els.) AI Safety Grid­wor­lds is a good ex­am­ple of de­vel­op­ing in­tu­itions about AI safety from look­ing at mod­ern sys­tems.


Pros: Hu­mans are (at least on some axes) the most in­tel­li­gent sys­tems we have di­rect ac­cess to. We can use this to no­tice failures of other prox­ies by notic­ing that we can do bet­ter at op­ti­miz­ing some func­tion us­ing clever tricks that those sys­tems might also be able to use when you scale them up. Also, hu­mans are an in­tel­li­gence that we can ex­pe­rience from the in­side.

Cons: Hu­mans are opaque. Even from our in­side view, it is very difficult to un­der­stand how they work, and very hard to mod­ify. They are also the most difficult to talk about rigor­ously. There is also the failure mode of an­thro­po­mor­phiz­ing badly and at­tribut­ing ar­bi­trary prop­er­ties of hu­mans (and es­pe­cially hu­man goals) to AGI.

Ex­am­ples: Some have used the com­plex­ity of the hu­man brain as a start­ing point for pre­dict­ing how much hard­ware is nec­es­sary for AGI.

Ought plans to try to run ex­per­i­ments on hu­mans that will help form in­tu­itions on how easy it is to de­com­pose big ques­tions into small pieces, which can provide ev­i­dence about whether or not Paul Chris­ti­ano style am­plifi­ca­tion could work.

We can use hu­mans to provide lower bounds for what ca­pa­bil­ities are pos­si­ble, like in the AI box ex­per­i­ment.


Pros: Groups of hu­mans can of­ten be more trans­par­ent than in­di­vi­d­ual hu­mans, since the in­di­vi­d­u­als within the group com­mu­ni­cate in a trans­par­ent way. This is per­haps the most alien agent-like thing we have di­rect ac­cess to, which could provide a check for an­thro­po­mor­phism. Think­ing about groups could also en­courage re­duc­tion­ism. All agents are made up of smaller parts, and groups make this more ex­plicit.

Cons: This is prob­a­bly the worst of the four sources. Groups are of­ten not very agen­tic, and end up shoot­ing them­selves in the foot. Also, if your model of AI looks a lot like or­ga­niz­ing a group of hy­pothe­ses (e.g. Solomonoff In­duc­tion), you are more likely to ac­ci­den­tally pass the buck and de­pend on an in­di­vi­d­ual com­po­nent be­ing as pow­er­ful as the en­tire sys­tem.

Ex­am­ples: Evolu­tion is a par­tic­u­larly use­ful group to gain in­tu­itions from. You can get an (up­per bound) es­ti­mate for how difficult AI is by look­ing at how many bits of op­ti­miza­tion evolu­tion needed.

Evolu­tion is also a very clear ex­am­ple and source of in­tu­ition for think­ing about Op­ti­miza­tion Dae­mons.

Some AI safety pro­pos­als look a lot like or­ga­niz­ing a lot of smaller AIs to­gether into one more pow­er­ful thing, like Paul Chris­ti­ano’s am­plifi­ca­tion and Eric Drexler’s in­tel­li­gence dis­til­la­tion.

Ab­stract Models

Pros: Ab­stract mod­els can deal with sys­tems that are much smarter than the other three sources. In­stead of look­ing at the best op­ti­mizer we can ac­tu­ally run, we can think about the prop­er­ties of the in­put that liter­ally max­i­mizes what­ever thing we are op­ti­miz­ing. Ab­stract mod­els also have more po­ten­tial for rigor than the other sources, and con­clu­sions that you get from think­ing about ab­stract mod­els of­ten ap­ply to many differ­ent con­crete sys­tems.

Cons: Ab­stract mod­els can’t be ob­served di­rectly, so we have to de­pend on proofs that only tell us about some prop­er­ties of the sys­tem. The in­sights we gain from ab­stract mod­els might end up be­ing very hard to port over into con­crete sys­tems. Ab­stract mod­els of­ten deal with up­per bounds for ca­pa­bil­ities, which may be un­re­al­is­tic be­cause they re­quire more re­sources than our uni­verse has ac­cess to.

Ex­am­ples: AIXI is a use­ful spe­cific ab­stract model of AGI. MIRI’s Agent Foun­da­tions work is mostly try­ing to get data from think­ing about ab­stract mod­els. Con­cepts like or­thog­o­nal­ity the­sis and con­ver­gent in­stru­men­tal goals come from think­ing about ab­stract mod­els.