Defining Myopia

IID vs Myopia

In a com­ment to Par­tial Agency, Ro­hin sum­ma­rized his un­der­stand­ing of the post. He used the iid as­sump­tion as a crit­i­cal part of his story. Ini­tially, I thought that this was a good de­scrip­tion of what was go­ing on; but I soon re­al­ized that iid isn’t my­opia at all (and com­mented as such). This post ex­pands on the thought.

My origi­nal post con­flated epi­sodic (which is ba­si­cally ‘iid’) with my­opic.

In an epi­sodic set­ting, it makes sense to be my­opic about any­thing be­yond the cur­rent epi­sode. There’s no benefit to cross-epi­sode strate­gies, so, no need to learn them.

This is true at sev­eral lev­els (which I men­tion in the hopes of avoid­ing later con­fu­sion):

  • In de­sign­ing an ML al­gorithm, if we are as­sum­ing an epi­sodic struc­ture, it makes sense to use a learn­ing al­gorithm which is de­signed to be my­opic.

  • A learn­ing al­gorithm in an epi­sodic set­ting has no in­cen­tive to find non-my­opic solu­tions (even if it can).

How­ever, it is also pos­si­ble to con­sider my­opia in the ab­sence of epi­sodic struc­ture, and not just as a mis­take. We might want an ML al­gorithm to learn my­opic strate­gies, as is the case with pre­dic­tive sys­tems. (We don’t want them to learn to ma­nipu­late the data; and even though that failure mode is far-fetched for most mod­ern sys­tems, there’s no point set­ting up learn­ing pro­ce­dures which would in­cen­tivise it. In­deed, learn­ing pro­ce­dures seem to mostly en­courage my­opia, though the full situ­a­tion is still un­clear to me.)

Th­ese my­opic strate­gies aren’t just “strate­gies which be­have as if there were an epi­sodic as­sump­tion”, ei­ther. For ex­am­ple, se­quen­tial pre­dic­tion is my­opic (the goal is to pre­dict each next item ac­cu­rately, not to get the most ac­cu­racy over­all—if this is un­clear, hope­fully it will be­come clearer in the next sec­tion).

So, there’s a dis­tinc­tion be­tween not re­mem­ber­ing the past vs not look­ing ahead to the fu­ture. In epi­sodic set­tings, the rele­vant parts of past and fu­ture are both limited to the du­ra­tion of the epi­sode. How­ever, the two come apart in gen­eral. We can have/​want my­opic agents with mem­ory; or, we can have/​want mem­o­ryless agents which are not my­opic. (The sec­ond seems some­what more ex­otic.)

Game-The­o­retic My­opia Definition

So far, I’ve used ‘my­opia’ in more or less two ways: an in­clu­sive no­tion which en­com­passes a big cluster of things, and also the spe­cific thing of only op­ti­miz­ing each out­put to max­i­mize the very next re­ward. Let’s call the more spe­cific thing “ab­solute” my­opia, and try to define the more gen­eral thing.

My­opia can’t be defined in terms of op­ti­miz­ing an ob­jec­tive in the usual sense—there isn’t one quan­tity be­ing op­ti­mized. How­ever, it seems like most things in my ‘my­opia’ cluster can be de­scribed in terms of game the­ory.

Let’s put down some defi­ni­tions:

Se­quen­tial de­ci­sion sce­nario: An in­ter­ac­tive en­vi­ron­ment which takes in ac­tions and out­puts re­wards and ob­ser­va­tions. I’m not try­ing to deal with em­bed­ded­ness is­sues; this is ba­si­cally the AIXI setup. (I do think ‘re­ward’ is a very re­stric­tive as­sump­tion about what kind of feed­back the sys­tem gets, but talk­ing about other al­ter­na­tives seems like a dis­trac­tion from the cur­rent post.)

(Gen­er­al­ized) ob­jec­tive: A gen­er­al­ized ob­jec­tive as­signs, to each ac­tion , a func­tion . The quan­tity is how much the nth de­ci­sion is sup­posed to value the ith re­ward. Prob­a­bly, we re­quire the sum to ex­ist.

Some ex­am­ples:

  • Ab­solute my­opia. if , and oth­er­wise.

  • Back-scratch­ing var­i­ant: if , and 0 oth­er­wise.

  • Epi­sodic my­opia. if and are within the same epi­sode; 0 oth­er­wise.

  • Hyper­bolic dis­count­ing. , in; 0 oth­er­wise.

  • Dy­nam­i­cally con­sis­tent ver­sion of hy­per­bolic:

  • Ex­po­nen­tial dis­count­ing. , typ­i­cally with c<1.

  • ‘Self-defeat­ing’ func­tions, such as for n=i, −1 for n=i+1, and 0 oth­er­wise.

A gen­er­al­ized ob­jec­tive could be called ‘my­opic’ if it is not dy­nam­i­cally con­sis­tent; ie, if there’s no way to write as a func­tion of alone, elimi­nat­ing the de­pen­dence on .

This no­tion of my­opia does not seem to in­clude ‘di­rec­tion­al­ity’ or ‘stop-gra­di­ents’ from my origi­nal post. In par­tic­u­lar, if we try to model pure pre­dic­tion, ab­solute my­opia cap­tures the idea that you aren’t sup­posed to have ma­nipu­la­tive strate­gies which lie (throw out some re­ward for one in­stance in or­der to get more over­all). How­ever, it does not rule out ma­nipu­la­tive strate­gies which se­lect self-fulfilling prophe­cies strate­gi­cally; those achieve high re­ward on in­stance by choice of out­put , which is what a my­opic agent is sup­posed to do.

There are also non-my­opic ob­jec­tives which we can’t rep­re­sent here but might want to rep­re­sent more gen­er­ally: there isn’t a sin­gle well-defined ob­jec­tive cor­re­spond­ing to ‘max­i­miz­ing av­er­age re­ward’ (the limit of ex­po­nen­tial dis­count­ing as ).

Vanessa re­cently men­tioned us­ing game-the­o­retic mod­els like this for the pur­pose of mod­el­ing in­con­sis­tent hu­man val­ues. I want to em­pha­size that (1) I don’t want to think of my­opia as nec­es­sar­ily ‘wrong’; it seems like some­times a my­opic ob­jec­tive is a le­gi­t­i­mate one, for the pur­pose of build­ing a sys­tem which does some­thing we want (such as make non-ma­nipu­la­tive pre­dic­tions). As such, (2) my­opia is not just about bounded ra­tio­nal­ity.

I also don’t nec­es­sar­ily want to think of my­opia as multi-agent, even when mod­el­ing it with multi-agent game the­ory like this. I’d rather think about learn­ing one my­opic policy, which makes the ap­pro­pri­ate (non-)trade-offs based on .

In or­der to think about a sys­tem be­hav­ing my­opi­cally, we need to use an equil­ibrium no­tion (such as Nash equil­ibria or cor­re­lated equil­ibria), not just . How­ever, I’m not sure quite how I want to talk about this. We don’t want to think in terms of a big equil­ibrium be­tween each de­ci­sion-point ; I think of that as a se­lec­tion-vs-con­trol mis­take, treat­ing the se­quen­tial de­ci­sion sce­nario as one big thing to be op­ti­mized. Or, putting it an­other way: the prob­lem is that we have to learn; so we can’t talk about ev­ery­thing be­ing in equil­ibrium from the be­gin­ning.

Per­haps we can say that there should be some such that each de­ci­sion af­ter that is in ap­prox­i­mate equil­ibrium with each other tak­ing the de­ci­sions be­fore as given.

(Aside—What we definitely don’t want (if we want to de­scribe or en­g­ineer le­gi­t­i­mately my­opic be­hav­ior) is a frame­work where the differ­ent de­ci­sion-points end up bar­gain­ing with each other (acausal trade, or mere causal trade), in or­der to take pareto im­prove­ments and thus move to­ward full agency. IE, in or­der to keep our dis­tinc­tions from fal­ling apart, we can’t ap­ply a de­ci­sion the­ory which would co­op­er­ate in Pri­soner’s Dilemma or similar things. This could pre­sent difficul­ties.)

Let’s move on to a differ­ent way of think­ing about my­opia, through the lan­guage of Pareto-op­ti­mal­ity.

Pareto Definition

We can think of my­opia as a re­fusal to take cer­tain Pareto im­prove­ments. This fits well with the pre­vi­ous defi­ni­tion; if an agent takes all the Pareto im­prove­ments, then its be­hav­ior must be con­sis­tent with some global weights not a func­tion of . How­ever, not all my­opic strate­gies in the Pareto sense have nice rep­re­sen­ta­tions in terms of gen­er­al­ized ob­jec­tives.

In par­tic­u­lar: I men­tioned that gen­er­al­ized ob­jec­tives couldn’t rule out ma­nipu­la­tion through se­lec­tion of self-fulful­ling prophe­cies; so, only cap­ture part of what seems im­plied by map/​ter­ri­tory di­rec­tion­al­ity. Think­ing in terms of Pareto-failures, we can also talk about failing to reap the gains from se­lec­tion of ma­nipu­la­tive self-fulfilling prophe­cies.

How­ever, think­ing in these terms is not very satis­fy­ing. It al­lows a very broad no­tion of my­opia, but has few other virtues. Gen­er­al­ized ob­jec­tives let me talk about my­opic agents try­ing to do a spe­cific thing, even though the thing they’re try­ing to do isn’t a co­her­ent ob­jec­tive. Defin­ing my­opia as failure to take cer­tain Pareto im­prove­ments doesn’t give me any struc­ture like that; a my­opic agent is be­ing defined in the nega­tive, rather than de­scribed pos­i­tively.

Here, as be­fore, we also have the prob­lem of defin­ing things learn­ing-the­o­ret­i­cally. Speak­ing purely in terms of whether the agent takes cer­tain Pareto im­prove­ments doesn’t re­ally make sense, be­cause it has to learn what situ­a­tion it is in. We want to talk about learn­ing pro­cesses, so we need to talk about learn­ing to take the Pareto im­prove­ments, some­how.

(Bayesian learn­ing can be de­scribed in terms of Pareto op­ti­mal­ity di­rectly, be­cause us­ing a prior over pos­si­ble en­vi­ron­ments al­lows Pareto-op­ti­mal be­hav­ior in terms of those en­vi­ron­ments. How­ever, work­ing that way re­quires re­al­iz­abil­ity, which isn’t re­al­is­tic.)

De­ci­sion Theory

In the origi­nal par­tial agency post, I de­scribed full agency as an ex­treme (per­haps imag­i­nary) limit of less and less my­opia. Full agency is like Carte­sian du­al­ism, sit­ting fully out­side the uni­verse and op­ti­miz­ing.

Is full agency that difficult? From the gen­er­al­ized-ob­jec­tive for­mal­ism, one might think that or­di­nary RL with ex­po­nen­tial dis­count­ing is suffi­cient.

The coun­terex­am­ples to this are MIRI-es­que de­ci­sion prob­lems, which cre­ate dy­namic in­con­sis­ten­cies for oth­er­wise non-my­opic agents. (See this com­ment thread with Vanessa for more dis­cus­sion of sev­eral of the points I’m about to make.)

To give a sim­ple ex­am­ple, the ver­sion of New­comb’s Prob­lem where the pre­dic­tor knows about as much about your be­hav­ior as you do. (The ver­sion where the pre­dic­tor is nearly in­fal­lible is eas­ily han­dled by RL-like learn­ing; you need to speci­fi­cally in­ject so­phis­ti­cated CDT-like think­ing to mess that one up.)

In or­der to have good learn­ing-the­o­retic prop­er­ties at all, we need to have ep­silon ex­plo­ra­tion. But if we do, then we tend to learn to 1-box, be­cause (it will seem) do­ing so is in­de­pen­dent of the pre­dic­tor’s pre­dic­tions of us.

Now, it’s true that in a se­quen­tial set­ting, there will be some in­cen­tive to 2-box not for the pay­off to­day, but for the fu­ture; es­tab­lish­ing a rep­u­ta­tion of 1-box­ing gets higher pay­offs in iter­ated New­comb in a straight­for­ward (causal) way.

How­ever, that’s not enough to en­tirely avoid dy­namic in­con­sis­tency. For any dis­count­ing func­tion, we need only to as­sume that the in­stances of New­comb’s prob­lem are spaced out far enough over time so that 2-box­ing in each in­di­vi­d­ual case is ap­peal­ing.

Now, one might ar­gue that in this case, the agent is cor­rectly re­spect­ing its gen­er­al­ized ob­jec­tive; it’s sup­posed to sac­ri­fice fu­ture value for pre­sent ac­cord­ing to the dis­count­ing func­tion. And that’s true, if we want my­opic be­hav­ior. But it is dy­nam­i­cally in­con­sis­tent—the agent wishes to 2-box in each in­di­vi­d­ual case, but with re­spect to fu­ture cases, would pre­fer to 1-box. It would hap­pily bind its fu­ture ac­tions given an op­por­tu­nity to do so.

Like the is­sue with self-fulfilling prophe­cies, this cre­ates a type of my­opia which we can’t re­ally talk about within the for­mal­ism of gen­er­al­ized ob­jec­tives. Even with an ap­par­ently dy­nam­i­cally con­sis­tent dis­count­ing func­tion, the agent is in­con­sis­tent. As men­tioned ear­lier, we need gen­er­al­ized-ob­jec­tive sys­tems to fail to co­or­di­nate with them­selves; oth­er­wise, their goals col­lapse into reg­u­lar ob­jec­tives. So this is a type of my­opia which all gen­er­al­ized ob­jec­tives pos­sess.

As be­fore, I’d re­ally pre­fer to be able to talk about this with spe­cific types of my­opia (as with my­opic gen­er­al­ized ob­jec­tives), rather than just point­ing to a dy­namic in­con­sis­tency and clas­sify­ing it with my­opia.

(We might think of the fully non-my­opic agent as the limit of less and less dis­count­ing, as Vanessa sug­gests. This has some prob­lems of con­ver­gence, but per­haps that’s in line with non-my­opia be­ing an ex­treme ideal which doesn’t always make sense. Alter­nately, we might thing of this as a prob­lem of de­ci­sion the­ory, ar­gu­ing that we should be able reap the ad­van­tages of 1-box­ing de­spite our val­ues tem­po­rally dis­count­ing. Or, there might be some other wilder gen­er­al­iza­tion of ob­jec­tive func­tions which lets us rep­re­sent the dis­tinc­tions we care about.)

Mechanism De­sign Analogy

I’ll close this post with a sketchy con­jec­ture.

Although I don’t want to think of gen­er­al­ized ob­jec­tives as truly multi-agent in the one-‘agent’-per-de­ci­sion sense, learn­ing al­gorithms will typ­i­cally have a space of pos­si­ble hy­pothe­ses which are (in some sense) com­pet­ing with each other. We can analo­gize that to many com­pet­ing agents (keep­ing in mind that they may in­di­vi­d­u­ally be ‘par­tial agents’, ie, we can’t nec­es­sar­ily model them as co­her­ently pur­su­ing a util­ity func­tion).

For any par­tic­u­lar type of my­opia (whether or not we can cap­ture it in terms of a gen­er­al­ized ob­jec­tive), we can ask the ques­tion: is it pos­si­ble to de­sign a train­ing pro­ce­dure which will learn that type of my­opia?

(We can ap­proach this ques­tion in differ­ent ways; asymp­totic con­ver­gence, bounded-loss (which may give use­ful bounds at finite time), or ‘in-prac­tice’ (which fully ac­counts for finite-time effects). As I’ve men­tioned be­fore, my thoughts on this are mostly asymp­totic at the mo­ment, that be­ing the eas­ier the­o­ret­i­cal ques­tion.)

We can think of this ques­tion—the ques­tion of de­sign­ing train­ing pro­ce­dures—as a mechanism-de­sign ques­tion. Is it pos­si­ble to set up a sys­tem of in­cen­tives which en­courages a given kind of be­hav­ior?

Now, mechanism de­sign is a field which is as­so­ci­ated with nega­tive re­sults. It is of­ten not pos­si­ble to get ev­ery­thing you want. As such, a nat­u­ral con­jec­ture might be:

Con­jec­ture: It is not pos­si­ble to set up a learn­ing sys­tem which gets you full agency in the sense of even­tu­ally learn­ing to take all the Pareto im­prove­ments.

This con­jec­ture is still quite vague, be­cause I have not stated what it means to ‘learn to take all the Pareto im­prove­ments’. Ad­di­tion­ally, I don’t re­ally want to as­sume the AIXI-like set­ting which I’ve sketched in this post. The set­ting doesn’t yield very good learn­ing-the­o­retic re­sults any­way, so get­ting a nega­tive re­sult here isn’t that in­ter­est­ing. Ideally the con­jec­ture should be for­mu­lated in a set­ting where we can con­trast it to some pos­i­tive re­sults.

There’s also rea­son to sus­pect the con­jec­ture to be false. There’s a nat­u­ral in­stru­men­tal con­ver­gence to­ward dy­namic con­sis­tency; a sys­tem will self-mod­ify to greater con­sis­tency in many cases. If there’s an at­trac­tor basin around full agency, one would not ex­pect it to be that hard to set up in­cen­tives which push things into that at­trac­tor basin.