Modeling the capabilities of advanced AI systems as episodic reinforcement learning

Here I’ll sum­ma­rize the main ab­strac­tion I use for think­ing about fu­ture AI sys­tems. This is es­sen­tially the same model that Paul uses. I’m not ac­tu­ally in­tro­duc­ing any new ideas in this post; mostly this is in­tended to sum­ma­rize my cur­rent views.


If we are to think of highly ad­vanced AI sys­tems, it is use­ful to treat some AI ca­pa­bil­ities as a kind of black box: we need a good un­der­stand­ing of where the op­ti­miza­tion power made pos­si­ble by fu­ture hard­ware and al­gorithms is go­ing, so that we can think about what the op­ti­mized things look like with­out know­ing the ex­act de­tails of the op­ti­miza­tion al­gorithm. I’m go­ing to state a gen­er­al­iza­tion over some of these ca­pa­bil­ities, which is based on cur­rent ML prac­tice (train­ing poli­cies to op­ti­mize train­ing ob­jec­tives). The cur­rent sin­gle best model I know of for rea­son­ing about highly ca­pa­ble AI sys­tems is to as­sume that they have this gen­eral ca­pa­bil­ity and no other ca­pa­bil­ities.

The gen­eral ca­pa­bil­ity can be stated as: trained poli­cies will re­ceive a high av­er­age within-epi­sode train­ing score, com­pared to al­ter­na­tive poli­cies with similar re­source bounds. I’ll call this ca­pa­bil­ity “gen­eral epi­sodic RL”.

To clar­ify “within-epi­sode”, we could con­sider a sin­gle clas­sifi­ca­tion task (clas­sify this pic­ture) as a sin­gle epi­sode in a su­per­vised learn­ing con­text, where the policy is the clas­sifier; we could con­sider solv­ing a sin­gle SAT prob­lem to be an epi­sode in a SAT solv­ing con­text, where the policy is a SAT solver; and of course we have epi­sodes in an epi­sodic RL con­text. So sys­tems with this gen­eral ca­pa­bil­ity are, among other things, good su­per­vised learn­ers, SAT solvers, and epi­sodic RL agents.


If ad­vanced AI sys­tems have this gen­eral ca­pa­bil­ity (and no other ca­pa­bil­ities), this im­plies that:

  1. We don’t have good perfor­mance guaran­tees when test epi­sodes are dis­t­in­guish­able from train­ing epi­sodes (i.e. when some dis­t­in­guisher can look at an epi­sode and tell whether it is a train­ing epi­sode or a test epi­sode).

  2. When the test epi­sode is similar to train­ing epi­sodes (e.g. in an on­line learn­ing con­text), we should ex­pect trained poli­cies to act like a ra­tio­nal agent max­i­miz­ing its ex­pected score in this test epi­sode; oth­er­wise, the policy that acts as a ra­tio­nal agent would would get a higher ex­pected test score than this one, and would there­fore re­ceive the high­est train­ing score.

  3. In par­tic­u­lar, what it means for a policy to “max­i­mize its av­er­age test score” is “there is no al­ter­na­tive policy get­ting a higher ex­pected test score on the same se­quence of test epi­sodes.”

  4. (3) im­plies that there is no au­to­matic in­cen­tive to in­fluence fu­ture test epi­sodes (e.g. cause hu­mans to give the AI eas­ier-to-la­bel images in the fu­ture).

  5. (3) also im­plies that col­lu­sion does not hap­pen in most nat­u­ral se­tups (e.g. un­der this policy, gen­er­a­tive ad­ver­sar­ial policy will not col­lude). In­stead, agents will play some­thing like a cor­re­lated equil­ibrium.

  6. If there is some low­ish-com­plex­ity fact that would help a policy gain a higher av­er­age train­ing score, then we should as­sume that the trained policy will act as if it knows this fact. Other­wise, it would get a lower train­ing score than an al­ter­na­tive policy that does know this fact and uses this to achieve a higher av­er­age train­ing score. In par­tic­u­lar, trained poli­cies will know gen­eral fea­tures about the train­ing setup.

Why might epi­sodic RL ca­pa­bil­ities be avali­able?

To de­velop re­li­able gen­eral epi­sodic RL sys­tems, we would need:

  1. Meth­ods for en­sur­ing that trained poli­cies no­tice when they may fail to gen­er­al­ize from a train­ing con­text to a test con­text, such as KWIK learn­ing and other meth­ods for iden­ti­fy­ing in­duc­tive am­bi­gui­ties.

  2. Bet­ter policy classes (e.g. neu­ral net­work ar­chi­tec­tures) and the­o­ret­i­cal analy­ses of them. For ex­am­ple, per­haps we could show that “ad­ver­sar­ial” hy­pothe­ses are not prob­le­matic, since there will ex­ist non-ad­ver­sar­ial var­i­ants of ad­ver­sar­ial hy­pothe­ses.

  3. Bet­ter op­ti­miza­tion al­gorithms and the­o­ret­i­cal analy­ses of them (which could prove e.g. non-col­lu­sion).

The hope is most of these tech­nolo­gies will be de­vel­oped in the course of AI ca­pa­bil­ities re­search. I find some as­pects of these prob­lems com­pel­ling as AI al­ign­ment re­search, es­pe­cially those in (1), both be­cause it isn’t guaran­teed that AI ca­pa­bil­ities re­searchers will au­to­mat­i­cally de­velop enough the­ory for un­der­stand­ing highly ad­vanced epi­sodic RL agents, and be­cause it is use­ful to have con­crete mod­els of things like in­duc­tive am­bi­guity iden­ti­fi­ca­tion available ahead of time (so more AI al­ign­ment re­search can build on them).

Us­ing gen­eral ca­pa­bil­ities to rea­son about hy­po­thet­i­cal al­igned AGI designs

Ideally, a pro­posed al­igned AGI de­sign states: “if we had ac­cess to al­gorithms im­ple­ment­ing gen­eral ca­pa­bil­ities X, Y, and Z, then we could ar­range these al­gorithms such that, if all al­gorithms ac­tu­ally de­liver these ca­pa­bil­ities, then the re­sult­ing sys­tem is pretty good for hu­man val­ues”. In prac­tice, it is usu­ally difficult to state a pro­posal this clearly, so ini­tial re­search will be aimed at get­ting to the point that a pro­posal of this form can be stated.

If we’re as­sum­ing that fu­ture AI sys­tems only have epi­sodic RL ca­pa­bil­ities, then the pro­posal will say some­thing like: “if we ar­range epi­sodic RL sys­tems in the right way, then if they play in an ap­prox­i­mate cor­re­lated equil­ibrium (within each epi­sode), then the sys­tem will do good things.” I think this ap­proach al­lows both “uni­tary” sys­tems and “de­com­posed” sys­tems, and both “tools” and “agents”, while mak­ing ev­ery­thing pre­cise enough that we don’t need to rely on the in­tu­itive mean­ings of these words to rea­son about ad­vanced AI sys­tems.

What other ca­pa­bil­ities could be available?

It’s worth think­ing about ad­di­tional gen­eral ca­pa­bil­ities that highly ad­vanced AI sys­tems might have.

  1. We could sup­pose that sys­tems are good at trans­fer learn­ing: they can gen­er­al­ize well from a train­ing con­text to a test con­text, with­out the train­ing con­text be­ing similar enough to the train­ing con­text that we’d ex­pect good train­ing perfor­mance to au­to­mat­i­cally im­ply good test perfor­mance. This is clearly pos­si­ble in some cases, and im­pos­si­ble in other cases, but it’s not clear where the bound­ary is.

  2. We could sup­pose that sys­tems are good at learn­ing “nat­u­ral” struc­ture in data (e.g. clusters and fac­tors) us­ing un­su­per­vised learn­ing.

  3. We could sup­pose that sys­tems will be good at pur­su­ing goals defined us­ing logic, even when there aren’t many train­ing ex­am­ples of cor­rect log­i­cal in­fer­ence to gen­er­al­ize from. Paul de­scribes a ver­sion of this prob­lem here. Ad­di­tion­ally, much of MIRI’s work in log­i­cal un­cer­tainty and de­ci­sion the­ory is rele­vant to de­sign­ing agents that pur­sue goals defined us­ing logic.

  4. We could sup­pose that sys­tems will be good at pur­su­ing en­vi­ron­men­tal goals (such as man­u­fac­tur­ing pa­per­clips). At the mo­ment, we have very lit­tle idea of how one might spec­ify such a goal, but it is at least imag­in­able that some fu­ture the­ory would al­low us to spec­ify the goal of man­u­fac­tur­ing pa­per­clips. I ex­pect any the­ory for how to do this to rely on (2) or (3), but I’m not sure.

This list isn’t ex­haus­tive; there are prob­a­bly lots of other gen­eral ca­pa­bil­ities we could as­sume.

Some use­ful AI al­ign­ment re­search may as­sume ac­cess to these al­ter­na­tive ca­pa­bil­ities (for ex­am­ple, one may try to define con­ser­va­tive con­cepts us­ing (2)), or at­tempt to build these ca­pa­bil­ities out of other ca­pa­bil­ities (for ex­am­ple, one may try to build ca­pa­bil­ity (4) out of ca­pa­bil­ities (1), (2), and (3)). This re­search is some­what limited by the fact that we don’t have good for­mal mod­els for study­ing hy­po­thet­i­cal sys­tems hav­ing these prop­er­ties. For ex­am­ple, at this time, it is difficult (though not im­pos­si­ble) to eval­u­ate pro­pos­als for low-im­pact AI, since we don’t un­der­stand en­vi­ron­men­tal goals, and we don’t know whether the AI will be able to find “nat­u­ral” fea­tures of the world that in­clude fea­tures that hu­mans care about.

Un­for­tu­nately, there doesn’t ap­pear to be a very good ar­gu­ment that we will prob­a­bly have any of these ca­pa­bil­ities in the fu­ture. It seems more likely than not to me that some gen­eral ca­pa­bil­ity that I listed will be highly rele­vant to AI al­ign­ment, but I’m highly un­cer­tain.

What’s the al­ter­na­tive?

If some­one is at­tempt­ing to de­sign al­igned AGI sys­tems, what could they do other than re­duce the prob­lem to gen­eral ca­pa­bil­ities like the ones I have talked about? Here I list some things that don’t ob­vi­ously fit into this model:

  1. We could test out AI sys­tems and use sys­tems that seem to have bet­ter be­hav­ior. But I see this as a form of train­ing: we’re cre­at­ing poli­cies and fil­ter­ing out the ones that we don’t like (i.e. giv­ing them a higher train­ing score if it looks like they are do­ing good things, and find­ing poli­cies that have a high train­ing score). If we filter AI sys­tems ag­gres­sively, then we’re ba­si­cally just train­ing things to max­i­mize train­ing score, and are back in an epi­sodic RL con­text. If we don’t filter them ag­gres­sively, then this limits how much use­ful work the fil­ter­ing can do (so we’d need an­other ar­gu­ment for why things will go well).

  2. We could as­sume that AI sys­tems will be trans­par­ent. I don’t have a good model for what trans­parency for ad­vanced AI sys­tems would look like. At least some ways of get­ting trans­parency (e.g. any­thing that only uses the I/​O be­hav­ior of the policy) re­duce to epi­sodic RL. Ad­di­tion­ally, while it is easy for poli­cies to be trans­par­ent if the policy class is rel­a­tively sim­ple, com­plex policy classes are re­quired to effec­tively make use of the ca­pa­bil­ities of an ad­vanced sys­tem. Re­gard­less, trans­parency is likely to be im­por­tant de­spite the cur­rent lack of for­mal mod­els for un­der­stand­ing it.

Per­haps there are more ideas like these. Over­all, I am skep­ti­cal if the ba­sic ar­gu­ment for why a sys­tem ought to be­have as in­tended uses more as­sump­tions than the fact that the sys­tem has ac­cess to suit­able ca­pa­bil­ities. Once we have a ba­sic ar­gu­ment for why things should prob­a­bly mostly work, we can elab­o­rate on this ar­gu­ment by e.g. ar­gu­ing that trans­parency will catch some vi­o­la­tions of the as­sump­tions re­quired for the ba­sic ar­gu­ment.