Let’s talk about “Convergent Rationality”

What this post is about: I’m out­lin­ing some thoughts on what I’ve been call­ing “con­ver­gent ra­tio­nal­ity”. I think this is an im­por­tant core con­cept for AI-Xrisk, and prob­a­bly a big crux for a lot of dis­agree­ments. It’s go­ing to be hand-wavy! It also ended up be­ing a lot longer than I an­ti­ci­pated.

Ab­stract: Nat­u­ral and ar­tifi­cial in­tel­li­gences tend to learn over time, be­com­ing more in­tel­li­gent with more ex­pe­rience and op­por­tu­nity for re­flec­tion. Do they also tend to be­come more “ra­tio­nal” (i.e. “con­se­quen­tial­ist”, i.e. “agenty” in CFAR speak)? Steve Omo­hun­dro’s clas­sic 2008 pa­per ar­gues that they will, and the “tra­di­tional AI safety view” and MIRI seem to agree. But I think this as­sumes an AI that already has a cer­tain suffi­cient “level of ra­tio­nal­ity”, and it’s not clear that all AIs (e.g. su­per­vised learn­ing al­gorithms) will ex­hibit or de­velop a suffi­cient level of ra­tio­nal­ity. De­con­fu­sion re­search around con­ver­gent ra­tio­nal­ity seems im­por­tant, and we should strive to un­der­stand the con­di­tions un­der which it is a con­cern as thor­oughly as pos­si­ble.

I’m writ­ing this for at least these 3 rea­sons:

  • I think it’d be use­ful to have a term (“con­ver­gent ra­tio­nal­ity”) for talk­ing about this stuff.

  • I want to ex­press, and clar­ify, (some of) my thoughts on the mat­ter.

  • I think it’s likely a crux for a lot of dis­agree­ments, and isn’t widely or quickly rec­og­nized as such. Op­ti­misti­cally, I think this ar­ti­cle might lead to sig­nifi­cantly more clear and pro­duc­tive dis­cus­sions about AI-Xrisk strat­egy and tech­ni­cal work.


  • Char­ac­ter­iz­ing con­ver­gent rationality

  • My im­pres­sion of at­ti­tudes to­wards con­ver­gent rationality

  • Re­la­tion to ca­pa­bil­ity control

  • Rele­vance of con­ver­gent ra­tio­nal­ity to AI-Xrisk

  • Con­clu­sions, some ar­gu­ments pro/​con con­ver­gent rationality

Char­ac­ter­iz­ing con­ver­gent rationality

Con­sider a su­per­vised learner try­ing to max­i­mize ac­cu­racy. The Bayes er­ror rate is typ­i­cally non-0, mean­ing it’s not pos­si­ble to get 100% test ac­cu­racy just by mak­ing bet­ter pre­dic­tions. If, how­ever, the test data(/​data dis­tri­bu­tion) were mod­ified, for ex­am­ple to only con­tain ex­am­ples of a sin­gle class, the learner could achieve 100% ac­cu­racy. If the learner were a con­se­quen­tial­ist with ac­cu­racy as its util­ity func­tion, it would pre­fer to mod­ify the test dis­tri­bu­tion in this way in or­der to in­crease its util­ity. Yet, even when given the op­por­tu­nity to do so, typ­i­cal gra­di­ent-based su­per­vised learn­ing al­gorithms do not seem to pur­sue such solu­tions (at least in my per­sonal ex­pe­rience as an ML re­searcher).

We can view the su­per­vised learn­ing al­gorithm as ei­ther ig­no­rant of, or in­differ­ent to, the strat­egy of mod­ify­ing the test data. But we can also this be­hav­ior as a failure of ra­tio­nal­ity, where the learner is “ir­ra­tionally” averse or blind to this strat­egy, by con­struc­tion. A strong ver­sion of the con­ver­gent ra­tio­nal­ity the­sis (CRT) would then pre­dict that given suffi­cient ca­pac­ity and “op­ti­miza­tion pres­sure”, the su­per­vised learner would “be­come more ra­tio­nal”, and be­gin to pur­sue the “mod­ify the test data” strat­egy. (I don’t think I’ve for­mu­lated CRT well enough to re­ally call it a the­sis, but I’ll con­tinue us­ing it in­for­mally).

More gen­er­ally, CRT would im­ply that de­on­tolog­i­cal ethics are not sta­ble, and de­on­tol­o­gists must con­verge to­wards con­se­quen­tial­ists. (As a caveat, how­ever, note that in gen­eral en­vi­ron­ments, de­on­tolog­i­cal be­hav­ior can be de­scribed as op­ti­miz­ing a (some­what con­trived) util­ity func­tion (grep “ex­is­tence proof” in the re­ward mod­el­ing agenda)). The alarm­ing im­pli­ca­tion would be that we can­not hope to build agents that will not de­velop in­stru­men­tal goals.

I sus­pect this pic­ture is wrong. At the mo­ment, the pic­ture I have is: im­perfectly ra­tio­nal agents will some­times seek to be­come more ra­tio­nal, but there may be limits on ra­tio­nal­ity which the “self-im­prove­ment op­er­a­tor” will not cross. This would be analo­gous to the limit of ω which the “add 1 op­er­a­tor” ap­proaches, but does not cross, in the or­di­nal num­bers. In other words, or­der to reach “ra­tio­nal­ity level” ω+1, it’s nec­es­sary for an agent to already start out at “ra­tio­nal­ity level” ω. A caveat: I think “ra­tio­nal­ity” is not uni-di­men­sional, but I will con­tinue to write as if it is.

My im­pres­sion of at­ti­tudes to­wards con­ver­gent rationality

Broadly speak­ing, MIRI seem to be strong be­liev­ers in con­ver­gent ra­tio­nal­ity, but their rea­sons for this view haven’t been very well-ar­tic­u­lated (TODO: ex­cept the in­ner op­ti­mizer pa­per?). AI safety peo­ple more broadly seem to have a wide range of views, with many peo­ple dis­agree­ing with MIRI’s views and/​or not feel­ing con­fi­dent that they un­der­stand them well/​fully.

Again, broadly speak­ing, ma­chine learn­ing (ML) peo­ple of­ten seem to think it’s a con­fused view­point bred out of an­thro­po­mor­phism, ig­no­rance of cur­rent/​prac­ti­cal ML, and para­noia. Peo­ple who are more fa­mil­iar with evolu­tion­ary/​ge­netic al­gorithms and ar­tifi­cial life com­mu­ni­ties might be a bit more sym­pa­thetic, and similarly for peo­ple who are con­cerned with feed­back loops in the con­text of al­gorith­mic de­ci­sion mak­ing.

I think a lot of peo­ple with work­ing on ML-based AI safety con­sider con­ver­gent ra­tio­nal­ity to be less rele­vant than MIRI does, be­cause 1) so far it is more of a hy­po­thet­i­cal/​the­o­ret­i­cal con­cern, whereas we’ve done a lot of and 2) cur­rent ML (e.g. deep RL with bells and whis­tles) seems dan­ger­ous enough be­cause of known and demon­strated speci­fi­ca­tion and ro­bust­ness prob­lems (e.g. re­ward hack­ing and ad­ver­sar­ial ex­am­ples).

In the many con­ver­sa­tions I’ve had with peo­ple from all these groups, I’ve found it pretty hard to find con­crete points of dis­agree­ment that don’t re­duce to differ­ences in val­ues (e.g. re­gard­ing long-ter­mism), time-lines, or bare in­tu­ition. I think “level of para­noia about con­ver­gent ra­tio­nal­ity” is likely an im­por­tant un­der­ly­ing crux.

Re­la­tion to ca­pa­bil­ity control

A plethora of naive ap­proaches to solv­ing safety prob­lems by limit­ing what agents can do have been pro­posed and re­jected on the grounds that ad­vanced AIs will be smart and ra­tio­nal enough to sub­vert them. Hyper­bol­i­cally, the tra­di­tional AI safety view is that “ca­pa­bil­ity con­trol” is use­less. Ir­ra­tional­ity can be viewed as a form of ca­pa­bil­ity con­trol.

Naively, ap­proaches which de­liber­ately re­duce an agent’s in­tel­li­gence or ra­tio­nal­ity should be an effec­tive form of ca­pa­bil­ity con­trol method (I’m guess­ing that’s a pro­posal in the Ar­tifi­cial Stu­pidity pa­per, but I haven’t read it). If this were true, then we might be able to build very in­tel­li­gent and use­ful AI sys­tems, but con­trol them by, e.g. mak­ing them my­opic, or re­strict­ing the hy­poth­e­sis class /​ search space. This would re­duce the “bur­den” on tech­ni­cal solu­tions to AI-Xrisk, mak­ing it (even) more of a global co­or­di­na­tion prob­lem.

But CRT sug­gests that these meth­ods of ca­pa­bil­ity con­trol might fail un­ex­pect­edly. There is at least one ex­am­ple (I’ve strug­gled to dig up) of a mem­ory-less RL agent learn­ing to en­code mem­ory in­for­ma­tion in the state of the world. More gen­er­ally, agents can re­cruit re­sources from their en­vi­ron­ments, im­plic­itly ex­pand­ing their in­tel­lec­tual ca­pa­bil­ities, with­out ac­tu­ally “self-mod­ify­ing”.

Rele­vance of con­ver­gent ra­tio­nal­ity to AI-Xrisk

Believ­ing CRT should lead to higher lev­els of “para­noia”. Tech­ni­cally, I think this should lead to more fo­cus on things that look more like as­surance (vs. ro­bust­ness or speci­fi­ca­tion). Believ­ing CRT should make us con­cerned that non-agenty sys­tems (e.g. trained with su­per­vised learn­ing) might start be­hav­ing more like agents.

Strate­gi­cally, it seems like the main im­pli­ca­tion of be­liev­ing in CRT per­tains to situ­a­tions where we already have fairly ro­bust global co­or­di­na­tion and a suffi­ciently con­cerned AI com­mu­nity. CRT im­plies that these con­di­tions are not suffi­cient for a good prog­no­sis: even if ev­ery­one us­ing AI makes a good-faith effort to make it safe, if they mis­tak­enly don’t be­lieve CRT, they can fail. So we’d also want the AI com­mu­nity to be­have as if CRT were true un­less or un­til we had over­whelming ev­i­dence that it was not a con­cern.

On the other hand, dis­be­lief in CRT shouldn’t al­lay our fears overly much; AIs need not be hy­per­ra­tional in or­der to pose sig­nifi­cant Xrisk. For ex­am­ple, we might be wiped out by some­thing more “grey goo”-like, i.e. an AI that is ba­si­cally a policy hy­per­op­ti­mized for the niche of the Earth, and doesn’t even have any­thing re­sem­bling a world(/​uni­verse) model, plan­ning pro­ce­dure, etc. Or we might cre­ate AIs that are like su­per­in­tel­li­gent hu­mans: hav­ing many cog­ni­tive bi­ases, but still agenty enough to thor­oughly out­com­pete us, and con­sid­er­ing lesser in­tel­li­gences of du­bi­ous moral sig­nifi­cance.

Con­clu­sions, some ar­gu­ments pro/​con con­ver­gent rationality

My im­pres­sion is that in­tel­li­gence (as in IQ/​g) and ra­tio­nal­ity are con­sid­ered to be only loosely cor­re­lated. My cur­rent model is that ML sys­tems be­come more in­tel­li­gent with more ca­pac­ity/​com­pute/​in­for­ma­tion, but not nec­es­sar­ily more ra­tio­nal. If this is true, is cre­ates ex­cit­ing prospects for forms of ca­pa­bil­ity con­trol. On the other hand, if CRT is true, this sup­ports the prac­tice of mod­el­ling all suffi­ciently ad­vanced AIs as ra­tio­nal agents.

I think the main ar­gu­ment against CRT is that, from an ML per­spec­tive, it seems like “ra­tio­nal­ity” is more or less a de­sign choice: we can make agents my­opic, we can hard-code flawed en­vi­ron­ment mod­els or rea­son­ing pro­ce­dures, etc.The main counter-ar­gu­ments arise from VNMUT, which can be in­ter­preted as say­ing “ra­tio­nal agents are more fit” (in an evolu­tion­ary sense). At the same time, it seems like the com­plex­ity of the real world (e.g. phys­i­cal limits of com­mu­ni­ca­tion and in­for­ma­tion pro­cess­ing) makes this a pretty weak ar­gu­ment. Hu­mans cer­tainly seem highly ir­ra­tional, and dis­t­in­guish­ing bi­ases and heuris­tics can be difficult.

A spe­cial case of this is the “in­ner op­ti­miz­ers” idea. The strongest ar­gu­ment for in­ner op­ti­miz­ers I’m aware of goes like: “the sim­plest solu­tion to a com­plex enough task (and there­for the eas­iest for weakly guided search, e.g. by SGD) is to in­stan­ti­ate a more agenty pro­cess, and have it solve the prob­lem for you”. The “in­ner” part comes from the pos­tu­late that a com­plex and flex­ible enough class of mod­els will in­stan­ti­ate such a agenty pro­cess in­ter­nally (i.e. us­ing a sub­set of the model’s ca­pac­ity). I cur­rently think this pic­ture is broadly speak­ing cor­rect, and is the third ma­jor (tech­ni­cal) pillar sup­port­ing AI-Xrisk con­cerns (along with Good­hart’s law and in­stru­men­tal goals).

The is­sues with tiling agents also sug­gest that the anal­ogy with or­di­nals I made might be stronger than it seems; it may be im­pos­si­ble for an agent to ra­tio­nally en­dorse a qual­i­ta­tively differ­ent form of rea­son­ing. Similarly, while “CDT wants to be­come UDT” (sup­port­ing CRT), my un­der­stand­ing is that it is not ac­tu­ally ca­pa­ble of do­ing so (op­pos­ing CRT) be­cause “you have to have been UDT all along” (thanks to Jes­sica Tay­lor for ex­plain­ing this stuff to me a few years back).

While I think MIRI’s work on ideal­ized rea­son­ers has shed some light on these ques­tions, I think in prac­tice, ran­dom(ish) “mu­ta­tion” (whether in­ten­tion­ally de­signed or im­posed by the phys­i­cal en­vi­ron­ment) and evolu­tion­ary-like pres­sures may push AIs across bound­aries that the “self-im­prove­ment op­er­a­tor” will not cross, mak­ing analy­ses of ideal­ized rea­son­ers less use­ful than they might naively ap­pear.

This ar­ti­cle is in­spired by con­ver­sa­tions with Alex Zhu, Scott Garrabrant, Jan Leike, Ro­hin Shah, Micah Car­rol, and many oth­ers over the past year and years.