Coherent behaviour in the real world is an incoherent concept

(Edit: I’m no longer con­fi­dent that the two defi­ni­tions I used be­low are use­ful. I still stand by the broad thrust of this post, but am in the pro­cess of re­think­ing the de­tails).

Ro­hin Shah has re­cently crit­i­cised Eliezer’s ar­gu­ment that “suffi­ciently op­ti­mised agents ap­pear co­her­ent”, on the grounds that any be­havi­our can be ra­tio­nal­ised as max­imi­sa­tion of the ex­pec­ta­tion of some util­ity func­tion. In this post I dig deeper into this dis­agree­ment, con­clud­ing that Ro­hin is broadly cor­rect, al­though the is­sue is more com­plex than he makes it out to be. Here’s Eliezer’s sum­mary of his origi­nal ar­gu­ment:

Vio­la­tions of co­her­ence con­straints in prob­a­bil­ity the­ory and de­ci­sion the­ory cor­re­spond to qual­i­ta­tively de­struc­tive or dom­i­nated be­hav­iors. Co­her­ence vi­o­la­tions so eas­ily com­puted as to be hu­manly pre­dictable should be elimi­nated by op­ti­miza­tion strong enough and gen­eral enough to re­li­ably elimi­nate be­hav­iors that are qual­i­ta­tively dom­i­nated by cheaply com­putable al­ter­na­tives. From our per­spec­tive this should pro­duce agents such that, ce­teris paribus, we do not think we can pre­dict, in ad­vance, any co­her­ence vi­o­la­tion in their be­hav­ior.

First, we need to clar­ify what Eliezer means by co­her­ence. He notes that there are many for­mu­la­tions of co­her­ence con­straints: re­stric­tions on prefer­ences which im­ply that an agent which obeys them is max­imis­ing the ex­pec­ta­tion of some util­ity func­tion. I’ll take the stan­dard ax­ioms of VNM util­ity as one rep­re­sen­ta­tive set of con­straints. In this frame­work, we con­sider a set O of dis­joint out­comes. A lot­tery is some as­sign­ment of prob­a­bil­ities to the el­e­ments of O such that they sum to 1. For any pair of lot­ter­ies, an agent can ei­ther pre­fer one to the other, or to be in­differ­ent be­tween them; let P be the func­tion (from pairs of lot­ter­ies to a choice be­tween them) defined by these prefer­ences. The agent is in­co­her­ent if P vi­o­lates any of the fol­low­ing ax­ioms: com­plete­ness, tran­si­tivity, con­ti­nu­ity, and in­de­pen­dence. Eliezer gives sev­eral ex­am­ples of how an agent which vi­o­lates these ax­ioms can be money-pumped, which is an ex­am­ple of the “de­struc­tive or dom­i­nated” be­havi­our he men­tions in the quote above. And any agent which doesn’t vi­o­late these ax­ioms has be­havi­our which cor­re­sponds to max­imis­ing the ex­pec­ta­tion of some util­ity func­tion over O (a func­tion map­ping the out­comes in O to real num­bers).

It’s cru­cial to note that, in this setup, co­her­ence is a prop­erty of an agent’s prefer­ences at a sin­gle point in time. The out­comes that we are con­sid­er­ing are all mu­tu­ally ex­clu­sive, so an agent’s prefer­ences over other out­comes are ir­rele­vant af­ter one out­come has already oc­curred. In ad­di­tion, prefer­ences are not ob­served but rather hy­po­thet­i­cal: since out­comes are dis­joint, we can’t ac­tu­ally ob­serve the agent choos­ing a lot­tery and re­ceiv­ing a cor­re­spond­ing out­come (more than once).¹ But Eliezer’s ar­gu­ment above makes use of a con­cept of co­her­ence which differs in two ways: it is a prop­erty of the ob­served be­havi­our of agents over time. VNM co­her­ence is not well-defined in this setup, so if we want to for­mu­late a rigor­ous ver­sion of this ar­gu­ment, we’ll need to spec­ify a new defi­ni­tion of co­her­ence which ex­tends the stan­dard in­stan­ta­neous-hy­po­thet­i­cal one. Here are two pos­si­ble ways of do­ing so:

  • Defi­ni­tion 1: Let O be the set of all pos­si­ble “snap­shots” of the state of the uni­verse at a sin­gle in­stant (which I shall call world-states). At each point in time when an agent chooses be­tween differ­ent ac­tions, that can be in­ter­preted as a choice be­tween lot­ter­ies over states in O. Its be­havi­our is co­her­ent iff the set of all prefer­ences re­vealed by those choices is con­sis­tent with some co­her­ent prefer­ence func­tion P over all pairs of lot­ter­ies over O AND there is a cor­re­spond­ing util­ity func­tion which as­signs val­ues to each state that are con­sis­tent with the rele­vant Bel­l­man equa­tions. In other words, an agent’s ob­served be­havi­our is co­her­ent iff there’s some util­ity func­tion such that the util­ity of each state is some fixed value as­signed to that state + the ex­pected value of the best course of ac­tion start­ing from that state, and the agent has always cho­sen the ac­tion with the high­est ex­pected util­ity.²

  • Defi­ni­tion 2: Let O be the set of all pos­si­ble ways that the en­tire uni­verse could play out from be­gin­ning to end (which I shall call world-tra­jec­to­ries). Again, at each point in time when an agent chooses be­tween differ­ent ac­tions, that can be in­ter­preted as a choice be­tween lot­ter­ies over O. How­ever, in this case no set of ob­served choices can ever be “in­co­her­ent”—be­cause, as Ro­hin notes, there is always a util­ity func­tion which as­signs max­i­mal util­ity to all and only the world-tra­jec­to­ries in which those choices were made.

To be clear on the differ­ence be­tween them, un­der defi­ni­tion 1 an out­come is a world-state, one of which oc­curs ev­ery timestep, and a co­her­ent agent makes ev­ery choice with­out refer­ence to any past events (ex­cept in­so­far as they provide in­for­ma­tion about its cur­rent state or fu­ture states). Whereas un­der defi­ni­tion 2 an out­come is an en­tire world-tra­jec­tory (com­posed of a se­quence of world-states), only one of which ever oc­curs, and a co­her­ent agent’s fu­ture ac­tions may de­pend on what hap­pened in the past in ar­bi­trary ways. To see how this differ­ence plays out in prac­tice, con­sider the fol­low­ing ex­am­ple of non-tran­si­tive travel prefer­ences: an agent which pays $50 to go from San Fran­cisco to San Jose, then $50 to go from San Jose to Berkeley, then $50 to go from Berkeley to San Fran­cisco (note that the money in this ex­am­ple is just a place­holder for any­thing the agent val­ues). Un­der 2, this isn’t ev­i­dence that the agent is in­co­her­ent, but rather just an in­di­ca­tion that it as­signs more util­ity to world-tra­jec­to­ries in which it trav­els round in a cir­cle than to other available world-tra­jec­to­ries. Since Eliezer uses this situ­a­tion as an ex­am­ple of in­co­her­ence, he clearly doesn’t in­tend to in­ter­pret be­havi­our as a choice be­tween lot­ter­ies over world-tra­jec­to­ries. So let’s ex­am­ine defi­ni­tion 1 in more de­tail. But first note that there is no co­her­ence the­o­rem which says that an agent’s util­ity func­tion needs to be defined over world-states in­stead of world-tra­jec­to­ries, and so it’ll take ad­di­tional ar­gu­ments to demon­strate that suffi­ciently op­ti­mised agents will care about the former in­stead of the lat­ter. I’m not aware of any par­tic­u­larly com­pel­ling ar­gu­ments for this con­clu­sion—in­deed, as I’ll ex­plain later, I think it’s more plau­si­ble to model hu­mans as car­ing about the lat­ter.

Okay, so what about defi­ni­tion 1? This is a more stan­dard in­ter­pre­ta­tion of hav­ing prefer­ences over time: re­quiring choices un­der un­cer­tainty to move be­tween differ­ent states makes this setup very similar to POMDPs, which are of­ten used in re­in­force­ment learn­ing. It would be nat­u­ral to now in­ter­pret the non-tran­si­tive travel ex­am­ple as fol­lows: let F, J and B be the states of be­ing in San Fran­cisco, San Jose and Berkeley re­spec­tively. Then pay­ing to go from F to J to B to F demon­strates in­co­her­ent prefer­ences over states (as­sum­ing there’s also an op­tion to just stay put in any of those states).

First prob­lem with this ar­gu­ment: there are no co­her­ence the­o­ries say­ing that an agent needs to main­tain the same util­ity func­tion over time. In fact, there are plenty of cases where you might choose to change your util­ity func­tion (or have that change thrust upon you). I like Nate Soares’ ex­am­ple of want­ing to be­come a rock­star; other pos­si­bil­ities in­clude be­ing black­mailed to change it, or sus­tain­ing brain dam­age. How­ever, it seems un­likely that a suffi­ciently in­tel­li­gent AGI will face these par­tic­u­lar is­sues—and in fact the more ca­pa­ble it is of im­ple­ment­ing its util­ity func­tion, the more valuable it will con­sider the preser­va­tion of that util­ity func­tion.³ So I’m will­ing to ac­cept that, past a cer­tain high level of in­tel­li­gence, changes sig­nifi­cant enough to af­fect what util­ity func­tion a hu­man would in­fer from that AGI’s be­havi­our seem un­likely.

Here’s a more im­por­tant prob­lem, though: we’ve now ruled out some prefer­ences which seem to be rea­son­able and nat­u­ral ones. For ex­am­ple, sup­pose you want to write a book which is so time­less that at least one per­son reads it ev­ery year for the next thou­sand years. There is no sin­gle point at which the state of the world con­tains enough in­for­ma­tion to de­ter­mine whether you’ve suc­ceeded or failed in this goal: in any given year there may be no re­main­ing record of whether some­body read it in a pre­vi­ous year (or the records could have been falsified, etc). This goal is fun­da­men­tally a prefer­ence over world-tra­jec­to­ries.⁴ In cor­re­spon­dence, Ro­hin gave me an­other ex­am­ple: a per­son whose goal is to play a great song in its en­tirety, and who isn’t satis­fied with the prospect of play­ing the fi­nal note while falsely be­liev­ing that they’ve already played the rest of the piece.⁵ More gen­er­ally, I think that virtue-ethi­cists and de­on­tol­o­gists are more ac­cu­rately de­scribed as car­ing about world-tra­jec­to­ries than world-states—and al­most all hu­mans use these the­o­ries to some ex­tent when choos­ing their ac­tions. Mean­while Eric Drexler’s CAIS frame­work re­lies on ser­vices which are bounded in time taken and re­sources used—an­other con­straint which can’t be ex­pressed just in terms of in­di­vi­d­ual world-states.

There’s a third is­sue with this fram­ing: in ex­am­ples like non-tran­si­tive travel, we never ac­tu­ally end up in quite the same state we started in. Per­haps we’ve got­ten sun­burned along the jour­ney. Per­haps we spent a few min­utes edit­ing our next blog post. At the very least, we’re now slightly older, and we have new mem­o­ries, and the sun’s po­si­tion has changed a lit­tle. So re­ally we’ve ended up in state F’, which differs in many ways from F. You can pre­sum­ably see where I’m go­ing with this: just like with defi­ni­tion 2, no se­ries of choices can ever demon­strate in­co­her­ent re­vealed prefer­ences in the sense of defi­ni­tion 1, since ev­ery choice ac­tu­ally made is be­tween a differ­ent set of pos­si­ble world-state out­comes. (At the very least, they differ in the agent’s mem­o­ries of which path it took to get there.⁶ And note that out­comes which are iden­ti­cal ex­cept for slight differ­ences in mem­o­ries should some­times be treated in very differ­ent ways, since hav­ing even a few bits of ad­di­tional in­for­ma­tion from ex­plo­ra­tion can be in­cred­ibly ad­van­ta­geous.)

Now, this isn’t so rele­vant in the hu­man con­text be­cause we usu­ally ab­stract away from the small de­tails. For ex­am­ple, if I offer to sell you an ice-cream and you re­fuse it, and then I offer it again a sec­ond later and you ac­cept, I’d take that as ev­i­dence that your prefer­ences are in­co­her­ent—even though tech­ni­cally the two offers are differ­ent be­cause ac­cept­ing the first just leads you to a state where you have an ice-cream, while ac­cept­ing the sec­ond leads you to a state where you both have an ice-cream and re­mem­ber re­fus­ing the first offer. Similarly, I ex­pect that you don’t con­sider two out­comes to be differ­ent if they only differ in the pre­cise pat­tern of TV static or the ex­act timing of leaves rustling. But again, there are no co­her­ence con­straints say­ing that an agent can’t con­sider such fac­tors to be im­mensely sig­nifi­cant, enough to to­tally change their prefer­ences over lot­ter­ies when you sub­sti­tute in one such out­come for the other.

So for the claim that suffi­ciently op­ti­mised agents ap­pear co­her­ent to be non-triv­ially true un­der my first defi­ni­tion of co­her­ence, we’d need to clar­ify that such co­her­ence is only with re­spect to out­comes when they’re cat­e­gorised ac­cord­ing to the fea­tures which hu­mans con­sider im­por­tant, ex­cept for the ones which are in­trin­si­cally tem­po­rally ex­tended. But then the stan­dard ar­gu­ments from co­her­ence con­straints no longer ap­ply. At this point I think it’s bet­ter to aban­don the whole idea of for­mal co­her­ence as a pre­dic­tor of real-world be­havi­our, and re­place it with Ro­hin’s no­tion of “goal-di­rect­ed­ness”, which is more up­front about be­ing in­her­ently sub­jec­tive, and doesn’t rule out any of the goals that hu­mans ac­tu­ally have.

Thanks to Tim Ge­newein, Ra­mana Ku­mar, Vic­to­ria Krakovna and Ro­hin Shah for dis­cus­sions which led to this post, and helpful com­ments.

[1] Disjoint­ed­ness of out­comes makes this ar­gu­ment more suc­cinct, but it’s not ac­tu­ally a nec­es­sary com­po­nent, be­cause once you’ve re­ceived one out­come, your prefer­ences over all other out­comes are al­lowed to change. For ex­am­ple, hav­ing won $1000000, the value you place on other fi­nan­cial prizes will very likely go down. This is re­lated to my later ar­gu­ment that you never ac­tu­ally have mul­ti­ple paths to end­ing up in the “same” state.

[2] Tech­ni­cal note: I’m as­sum­ing an in­finite time hori­zon and no dis­count­ing, be­cause re­mov­ing ei­ther of those con­di­tions leads to weird be­havi­our which I don’t want to dig into in this post. In the­ory this leaves open the pos­si­bil­ity of states with in­finite ex­pected util­ity, as well as lot­ter­ies over in­finitely many differ­ent states, but I think we can just stipu­late that nei­ther of those pos­si­bil­ities arises with­out chang­ing the core idea be­hind my ar­gu­ment. The un­der­ly­ing as­sump­tion here is some­thing like: whether we model the uni­verse as finite or in­finite shouldn’t sig­nifi­cantly af­fect whether we ex­pect AI be­havi­our to be co­her­ent over the next few cen­turies, for any use­ful defi­ni­tion of co­her­ent.

[3] Con­sider the two limit­ing cases: if I have no power to im­ple­ment my util­ity func­tion, then it doesn’t make any differ­ence what it changes to. By com­par­i­son, if I am able to perfectly ma­nipu­late the world to fulfil my util­ity func­tion, then there is no pos­si­ble change in it which will lead to bet­ter out­comes, and many which will lead to worse (from the per­spec­tive of my cur­rent util­ity func­tion).

[4] At this point you could ob­ject on a tech­ni­cal­ity: from the uni­tar­ity of quan­tum me­chan­ics, it seems as if the laws of physics are in fact re­versible, and so the cur­rent state of the uni­verse (or mul­ti­verse, rather) ac­tu­ally does con­tain all the in­for­ma­tion you the­o­ret­i­cally need to de­duce whether or not any pre­vi­ous goal has been satis­fied. But I’m limit­ing this claim to macro­scopic-level phe­nom­ena, for two rea­sons. Firstly, I don’t think our ex­pec­ta­tions about the be­havi­our of ad­vanced AI should de­pend on very low-level fea­tures of physics in this way; and sec­ondly, if the ob­jec­tion holds, then prefer­ences over world-states have all the same prob­lems as prefer­ences over world-tra­jec­to­ries.

[5] In a POMDP, we don’t usu­ally in­clude an agent’s mem­o­ries (i.e. a sub­set of pre­vi­ous ob­ser­va­tions) as part of the cur­rent state. How­ever, it seems to me that in the con­text of dis­cussing co­her­ence ar­gu­ments it’s nec­es­sary to do so, be­cause oth­er­wise go­ing from a known good state to a known bad state and back in or­der to gain in­for­ma­tion is an ex­am­ple of in­co­her­ence. So we could also for­mu­late this setup as a be­lief MDP. But I pre­fer talk­ing about it as a POMDP, since that makes the agent seem less Carte­sian—for ex­am­ple, it makes more sense to ask what hap­pens af­ter the agent “dies” in a POMDP than a be­lief MDP.

[6] Per­haps you can con­struct a coun­terex­am­ple in­volv­ing mem­ory loss, but this doesn’t change the over­all point, and if you’re con­cerned with such tech­ni­cal­ities you’ll also have to deal with the prob­lems I laid out in foot­note 4.