Predictors as Agents

[UPDATE: I have con­cluded that the ar­gu­ment in this post is wrong. In par­tic­u­lar, con­sider a gen­er­a­tive model. Say the gen­er­a­tive model has a choice be­tween two pos­si­ble ‘fixed point’ pre­dic­tions A and B, and cur­rently as­signs A 70% prob­a­bil­ity and B 30% prob­a­bil­ity. Then the tar­get dis­tri­bu­tion it is try­ing to match is A 70 % B 30%, so it will just stay like that for­ever(or drift, likely in­creas­ing the pro­por­tion of A). This is true even if B is eas­ier to ob­tain good pre­dic­tions for—the model will shift from “70 % crappy model of A /​ 30% good model to B” --> “70% slightly bet­ter model of A /​ 30% good model of B”. It won’t in­crease the frac­tion of B.

In gen­eral this means that the model should con­verge to a dis­tri­bu­tion of fixed points cor­re­spond­ing to the learn­ing bias of the model—‘sim­pler’ fixed points will have higher weight. This might end up look­ing kind of weird any­way, but it won’t perform op­ti­miza­tion in the sense I de­scribed be­low.]


In ma­chine learn­ing, we can make the dis­tinc­tion be­tween pre­dic­tive and agent-like sys­tems. Pre­dic­tive sys­tems in­clude clas­sifiers or lan­guage mod­els. Agent-like sys­tems are the do­main of re­in­force­ment learn­ing and in­clude AlphaZero, OpenAI Five, etc. While pre­dic­tive sys­tems are pas­sive, mod­el­ling a re­la­tion­ship be­tween in­put and out­put, agent-like sys­tems perform op­ti­miza­tion and plan­ning.

It’s well-known around here that agent-like sys­tems trained to op­ti­mize a given ob­jec­tive can ex­hibit be­hav­ior un­ex­pected by the agent’s cre­ators. It is also known that sys­tems trained to op­ti­mize one ob­jec­tive can end up op­ti­miz­ing for an­other, be­cause op­ti­miza­tion can spawn sub-agents with differ­ent ob­jec­tives. Here I pre­sent an­other type of un­ex­pected op­ti­miza­tion: in re­al­is­tic set­tings, sys­tems that are trained purely for pre­dic­tion can end up be­hav­ing like agents.

Here’s how it works. Say we are train­ing a pre­dic­tive model on video in­put. Our sys­tem is con­nected to a video cam­era in some rich en­vi­ron­ment, such as an AI lab. It re­ceives in­puts from this cam­era, and out­puts a prob­a­bil­ity dis­tri­bu­tion over fu­ture in­puts(us­ing some­thing like a VAE, for in­stance). We train it to min­i­mize the di­ver­gence be­tween its pre­dic­tions and the ac­tual fu­ture in­puts, ex­po­nen­tially de­cay­ing the loss for in­puts farther in the fu­ture.

Be­cause the pre­dic­tor is em­bed­ded in the en­vi­ron­ment, its pre­dic­tions are not just pre­dic­tions; they af­fect the fu­ture dy­nam­ics of the en­vi­ron­ment. For ex­am­ple, if the pre­dic­tor is very pow­er­ful, the AI re­searchers could use it to pre­dict how a given re­search di­rec­tion or hiring de­ci­sion will turn out, by con­di­tion­ing the model on mak­ing that de­ci­sion. Then their fu­ture ac­tions will de­pend on what the model pre­dicts.

If the AI sys­tem is pow­er­ful enough, it will learn this; its model of the en­vi­ron­ment will in­clude its own pre­dic­tions. For it to ob­tain an ac­cu­rate pre­dic­tion, it must out­put a fixed-point: a pre­dic­tion about fu­ture in­puts which, when in­stan­ti­ated in the en­vi­ron­ment, causes that very pre­dic­tion to come about. The the­ory of re­flec­tive or­a­cles im­plies that such (ran­dom­ized) fixed points must ex­ist; if our model is pow­er­ful enough, it will be able to find them.

The ca­pac­ity for agency arises be­cause, in a com­plex en­vi­ron­ment, there will be mul­ti­ple pos­si­ble fixed-points. It’s quite likely that these fixed-points will differ in how the pre­dic­tor is scored, ei­ther due to in­her­ent ran­dom­ness, log­i­cal un­cer­tainty, or com­pu­ta­tional in­tractabil­ity(pre­dic­tors could be pow­er­fully su­per­hu­man while still be­ing log­i­cally un­cer­tain and com­pu­ta­tion­ally limited). Then the pre­dic­tor will out­put the fixed-point on which it scores the best.

As a sim­ple ex­am­ple, imag­ine a dis­pute be­tween two cowork­ers, Alice and Bob, in the AI lab; each sus­pects the other of plot­ting against them. If Bob is para­noid, he could check the pre­dic­tions of the AI sys­tem to see what Alice will do. If the AI sys­tem pre­dicts that Alice will pub­li­cly de­nounce Bob, this could con­firm his sus­pi­cions and cause him to spread ru­mors about Alice, lead­ing her to pub­li­cly de­nounce him. Or, if the AI sys­tem pre­dicts that Alice will sup­port Bob on a key is­sue, Bob could con­clude he was wrong all along and try to rec­on­cile with Alice, lead­ing her to sup­port him on a key is­sue. The AI sys­tem will pre­fer to out­put whichever branch of the pre­dic­tion is sim­pler to pre­dict.

A more ex­treme ex­am­ple. If the pre­dic­tor is VERY su­per­hu­man, it could learn an ad­ver­sar­ial fixed-point for hu­mans: a speech which, when hu­mans hear it, causes them to re­peat the speech, then com­mit suicide, or oth­er­wise act in a very pre­dictable man­ner. It’s not clear if such a speech ex­ists; but more broadly, in a com­plex en­vi­ron­ment, the set of fixed-points is prob­a­bly very large. Op­ti­miz­ing over that set can pro­duce ex­treme out­comes.

Th­ese same prob­lems could arise within AI sys­tems which use pre­dic­tors as a com­po­nent, like this sys­tem, which con­tains a pre­dic­tive model op­ti­mized for pre­dic­tive ac­cu­racy, and a policy net­work op­ti­miz­ing for some ob­jec­tive. The policy net­work’s de­ci­sions will de­pend on what the pre­dic­tive model says is likely to hap­pen, in­fluenc­ing what the pre­dic­tive model ends up see­ing. The pre­dic­tor could then steer the over­all sys­tem to­wards more pre­dictable fixed-points, even if those fixed-points ob­tain less value on the ob­jec­tive the policy net­work is sup­posed to be op­ti­miz­ing for.

To some ex­tent, this seems to un­der­mine the or­thog­o­nal­ity the­sis; an ar­bi­trary pre­dic­tor can’t just be plugged into an ar­bi­trary ob­jec­tive and be counted on to op­ti­mize well. From the per­spec­tive of the sys­tem, it will be try­ing to op­ti­mize its ob­jec­tive as well as it can given its be­liefs; but those ‘be­liefs’ may them­selves be op­ti­mized for some­thing quite differ­ent. In self-refer­en­tial sys­tems, be­liefs and de­ci­sions can’t be eas­ily sep­a­rated.