Coherent behaviour in the real world is an incoherent concept

Note: af­ter putting this on­line, I no­ticed sev­eral prob­lems with my origi­nal fram­ing of the ar­gu­ments. While I don’t think they in­val­i­dated the over­all con­clu­sion, they did (iron­i­cally enough) make the post much less co­her­ent. The ver­sion be­low has been sig­nifi­cantly ed­ited in an at­tempt to alle­vi­ate these is­sues.

Ro­hin Shah has re­cently crit­i­cised Eliezer’s ar­gu­ment that “suffi­ciently op­ti­mised agents ap­pear co­her­ent”, on the grounds that any be­havi­our can be ra­tio­nal­ised as max­imi­sa­tion of the ex­pec­ta­tion of some util­ity func­tion. In this post I dig deeper into this dis­agree­ment, con­clud­ing that Ro­hin is broadly cor­rect, al­though the is­sue is more com­plex than he makes it out to be. Here’s Eliezer’s sum­mary of his origi­nal ar­gu­ment:

Vio­la­tions of co­her­ence con­straints in prob­a­bil­ity the­ory and de­ci­sion the­ory cor­re­spond to qual­i­ta­tively de­struc­tive or dom­i­nated be­hav­iors. Co­her­ence vi­o­la­tions so eas­ily com­puted as to be hu­manly pre­dictable should be elimi­nated by op­ti­miza­tion strong enough and gen­eral enough to re­li­ably elimi­nate be­hav­iors that are qual­i­ta­tively dom­i­nated by cheaply com­putable al­ter­na­tives. From our per­spec­tive this should pro­duce agents such that, ce­teris paribus, we do not think we can pre­dict, in ad­vance, any co­her­ence vi­o­la­tion in their be­hav­ior.

First we need to clar­ify what Eliezer means by co­her­ence. He notes that there are many for­mu­la­tions of co­her­ence con­straints: re­stric­tions on prefer­ences which im­ply that an agent which obeys them is max­imis­ing the ex­pec­ta­tion of some util­ity func­tion. I’ll take the stan­dard ax­ioms of VNM util­ity as one rep­re­sen­ta­tive set of con­straints. In this frame­work, we con­sider a set O of dis­joint out­comes. A lot­tery is some as­sign­ment of prob­a­bil­ities to the el­e­ments of O such that they sum to 1. For any pair of lot­ter­ies, an agent can ei­ther pre­fer one to the other, or to be in­differ­ent be­tween them; let P be the func­tion (from pairs of lot­ter­ies to a choice be­tween them) defined by these prefer­ences. The agent is in­co­her­ent if P vi­o­lates any of the fol­low­ing ax­ioms: com­plete­ness, tran­si­tivity, con­ti­nu­ity, and in­de­pen­dence. Eliezer gives sev­eral ex­am­ples of how an agent which vi­o­lates these ax­ioms can be money-pumped, which is an ex­am­ple of the “de­struc­tive or dom­i­nated” be­havi­our he men­tions in the quote above. And by the VNM the­o­rem, any agent which doesn’t vi­o­late these ax­ioms has prefer­ences which are equiv­a­lent to max­imis­ing the ex­pec­ta­tion of some util­ity func­tion over O (a func­tion map­ping the out­comes in O to real num­bers).

It’s cru­cial to note that, in this setup, co­her­ence is a prop­erty of an agent’s prefer­ences at a sin­gle point in time. The out­comes that we are con­sid­er­ing are all mu­tu­ally ex­clu­sive, so an agent’s prefer­ences over other out­comes are ir­rele­vant af­ter one out­come has already oc­curred. In ad­di­tion, prefer­ences are not ob­served but rather hy­po­thet­i­cal: since out­comes are dis­joint, we can’t ac­tu­ally ob­serve the agent choos­ing a lot­tery and re­ceiv­ing a cor­re­spond­ing out­come (more than once).¹ And those hy­po­thet­i­cal choices are always be­tween known lot­ter­ies with fixed prob­a­bil­ities, rather than be­ing based on our sub­jec­tive prob­a­bil­ity es­ti­mates as they are in the real world. But Eliezer’s ar­gu­ment above makes use of a ver­sion of co­her­ence which doesn’t pos­sess any of these traits: it is a prop­erty of the ob­served be­havi­our of agents with im­perfect in­for­ma­tion, over time. VNM co­her­ence is not well-defined in this setup, so if we want to for­mu­late a rigor­ous ver­sion of this ar­gu­ment, we’ll need to spec­ify a new defi­ni­tion of co­her­ence which ex­tends the stan­dard in­stan­ta­neous-hy­po­thet­i­cal one.

A first step is to in­tro­duce the el­e­ment of time, by chang­ing the one-off choice be­tween lot­ter­ies to re­peated choices. A nat­u­ral tool to use here is the Markov De­ci­sion Pro­cess (MDP) for­mal­ism: at each timestep, an agent chooses one of the ac­tions available in its cur­rent state, which leads it to a new state ac­cord­ing to a (pos­si­bly non­de­ter­minis­tic) tran­si­tion func­tion, re­sult­ing in a cor­re­spond­ing re­ward. We can think of our own world as a MDP (with­out re­wards), in which a state is a snap­shot of the en­tire uni­verse at a given in­stant. We can then define a tra­jec­tory as a se­quence of states and ac­tions which goes from the start­ing state of an MDP to a ter­mi­nal state. In the real world, this cor­re­sponds to a com­plete de­scrip­tion of one way in which the uni­verse could play out from be­gin­ning to end.

Here are two ways in which we could define an agent’s prefer­ences in the con­text of an MDP:

  • Defi­ni­tion 1: the agent has prefer­ences over states, and wants to spend its time in its preferred states, re­gard­less of which or­der it vis­its them or what its past tra­jec­tory looked like. This is equiv­a­lent to the agent want­ing to max­imise the re­wards it re­ceives from some re­ward func­tion defined over states.

  • Defi­ni­tion 2: the agent’s prefer­ences are choices be­tween lot­ter­ies over en­tire state-ac­tion tra­jec­to­ries it could take through the MDP. (In this case, we can ig­nore the re­wards.)

Un­der both of these defi­ni­tions, we can char­ac­ter­ise in­co­her­ence in a similar way as in the clas­sic VNM ra­tio­nal­ity setup, by eval­u­at­ing the agent’s prefer­ences over out­comes. To be clear on the differ­ence be­tween them, un­der defi­ni­tion 1 an out­come is a state, one of which oc­curs ev­ery timestep, and a co­her­ent agent’s prefer­ences over them are defined with­out refer­ence to any past events. Whereas un­der defi­ni­tion 2 an out­come is an en­tire tra­jec­tory (com­posed of a se­quence of states and ac­tions), only one of which ever oc­curs, and a co­her­ent agent’s prefer­ences about the fu­ture may de­pend on what hap­pened in the past in ar­bi­trary ways. To see how this differ­ence plays out in prac­tice, con­sider the fol­low­ing ex­am­ple of non-tran­si­tive travel prefer­ences: an agent which pays $50 to go from San Fran­cisco to San Jose, then $50 to go from San Jose to Berkeley, then $50 to go from Berkeley to San Fran­cisco (note that the money in this ex­am­ple is just a place­holder for any­thing the agent val­ues). Un­der defi­ni­tion 1, the agent vi­o­lates tran­si­tivity, and is in­co­her­ent. Un­der defi­ni­tion 2, it could just be that the agent prefers tra­jec­to­ries in which it trav­els round in a cir­cle, com­pared with other available tra­jec­to­ries. Since Eliezer uses this situ­a­tion as an ex­am­ple of in­co­her­ence, it seems like he doesn’t in­tend prefer­ences to be defined over tra­jec­to­ries. So let’s ex­am­ine defi­ni­tion 1 in more de­tail.

When we do so, we find that it has sev­eral short­com­ings—in par­tic­u­lar, it rules out some prefer­ences which seem to be rea­son­able and nat­u­ral ones. For ex­am­ple, sup­pose you want to write a book which is so time­less that at least one per­son reads it ev­ery year for the next thou­sand years. There is no sin­gle point at which the state of the world con­tains enough in­for­ma­tion to de­ter­mine whether you’ve suc­ceeded or failed in this goal: in any given year there may be no re­main­ing record of whether some­body read it in a pre­vi­ous year (or the records could have been falsified, etc). This goal is fun­da­men­tally a prefer­ence over tra­jec­to­ries.² In cor­re­spon­dence, Ro­hin gave me an­other ex­am­ple: some­one whose goal is to play a great song in its en­tirety, and who isn’t satis­fied with the prospect of play­ing the fi­nal note while falsely be­liev­ing that they’ve already played the rest of the piece. More gen­er­ally, I think that virtue-ethi­cists and de­on­tol­o­gists are more ac­cu­rately de­scribed as car­ing about world-tra­jec­to­ries than world-states—and al­most all hu­mans use these the­o­ries to some ex­tent when choos­ing their ac­tions. Mean­while Eric Drexler’s CAIS frame­work re­lies on ser­vices which are bounded in time taken and re­sources used—an­other con­straint which can’t be ex­pressed just in terms of in­di­vi­d­ual world-states.

At this point it may seem that defi­ni­tion 2 is su­pe­rior, but un­for­tu­nately it fails badly once we in­tro­duce the dis­tinc­tion be­tween hy­po­thet­i­cal and ob­served prefer­ences, by spec­i­fy­ing that we only get to ob­serve the agent’s be­havi­our in the MDP over N timesteps. Pre­vi­ously we’d still been as­sum­ing that we could elicit the agent’s hy­po­thet­i­cal prefer­ences about ev­ery pos­si­ble pair of lot­ter­ies, and judge its co­her­ence based on those. What would it in­stead mean for its be­havi­our to be in­co­her­ent?

  • Un­der defi­ni­tion 1, given some re­ward func­tion R, the value of an ac­tion can be defined us­ing Bel­l­man equa­tions as the ex­pected re­ward from the re­sult­ing tran­si­tion, plus the ex­pected value of the best ac­tion available at the next timestep. Then we can define an agent to be co­her­ent iff there is some R such that the agent is only ever ob­served to take the high­est-value ac­tion available to it.³

  • Un­der defi­ni­tion 2, let P be the agent’s policy. Then each ac­tion gives rise to a dis­tri­bu­tion over tra­jec­to­ries, and so we can in­ter­pret each choice of ac­tion taken as a choice be­tween lot­ter­ies over tra­jec­to­ries (in a way which de­pends on P, since the agent needs to pre­dict how its fu­ture self will be­have). Now we define an agent to be co­her­ent iff there is some policy P and some co­her­ent prefer­ence func­tion Q such that all ob­served choices are con­sis­tent with Q given the as­sump­tion that the agent will con­tinue fol­low­ing P.

It turns out that un­der defi­ni­tion 2, any se­quence of ac­tions is co­her­ent, since there’s always a prefer­ence func­tion un­der which the tra­jec­tory that ac­tu­ally oc­curred was the best one pos­si­ble (as Ro­hin pointed out here). I think this is a de­ci­sive ob­jec­tion to mak­ing claims about agents ap­pear­ing co­her­ent us­ing defi­ni­tion 2, and so we’re left with defi­ni­tion 1. But note that there is no co­her­ence the­o­rem which says that an agent’s prefer­ences need to be defined over states in­stead of tra­jec­to­ries, and in fact I’ve ar­gued above that the lat­ter is a more plau­si­ble model of hu­mans. So even if defi­ni­tion 1 turns out to be a use­ful one, it would take ad­di­tional ar­gu­ments to show that we should ex­pect that sort of co­her­ence from ad­vanced AIs, rather than (triv­ial) co­her­ence with re­spect to tra­jec­to­ries. I’m not aware of any com­pel­ling ar­gu­ments along those lines.

And in fact, defi­ni­tion 1 turns out to have fur­ther prob­lems. For ex­am­ple: I haven’t yet defined how a co­her­ent agent is meant to choose be­tween equally good op­tions. One nat­u­ral ap­proach is to sim­ply al­low it to make any choice in those situ­a­tions—it can hardly be con­sid­ered ir­ra­tional for do­ing so, since by as­sump­tion what­ever it chooses is just as good as any other op­tion. How­ever, in that case any be­havi­our is con­sis­tent with the in­differ­ent prefer­ence func­tion (which rates all out­comes as equal). So even un­der defi­ni­tion 1, any se­quence of ac­tions is co­her­ent. Now, I don’t think it’s very re­al­is­tic that su­per­in­tel­li­gent AGIs will ac­tu­ally be in­differ­ent about the effects of most of their ac­tions, so per­haps we can just rule out prefer­ences which fea­ture in­differ­ence too of­ten. But note that this adds an un­de­sir­able el­e­ment of sub­jec­tivity to our defi­ni­tion.

That sub­jec­tivity is ex­ac­er­bated when we try to model the fact that de­ci­sions in the real world are made un­der con­di­tions of im­perfect in­for­ma­tion. I won’t cover this in de­tail, but the ba­sic idea is that we change the set­ting from a MDP to a par­tially-ob­serv­able MDP (aka POMDP), and in­stead of re­quiring co­her­ent agents to take the ac­tions which are ac­tu­ally best ac­cord­ing to their prefer­ences, they sim­ply need to take the ac­tions which are best ac­cord­ing to their be­liefs. How do we know what their be­liefs are? We can’t de­duce them from agents’ be­havi­our, and we can’t just read them off from in­ter­nal rep­re­sen­ta­tions (at least, not in gen­eral). I think the clos­est we can get is to say that an agent is co­her­ent if there is any prior be­lief state and any co­her­ent prefer­ence func­tion such that, if we as­sume that it up­dates its be­liefs via Bayesian con­di­tion­al­i­sa­tion, the agent always takes the ac­tion which it be­lieves to be best. Un­for­tu­nately (but un­sur­pris­ingly), we’ve yet again defined in­co­her­ence out of ex­is­tence. In this case, given that we can only ob­serve a bounded num­ber of the agent’s ac­tions, there’s always some patholog­i­cal prior which jus­tifies its be­havi­our. We could ad­dress this prob­lem by adding the con­straint that the prior needs to be a “rea­son­able” one, but this is a very vague term, and there’s no con­sen­sus on what it ac­tu­ally means.

There’s a fi­nal is­sue with the whole setup of an agent travers­ing states: in the real world, and in ex­am­ples like non-tran­si­tive travel, we never ac­tu­ally end up in quite the same state we started in. Per­haps we’ve got­ten sun­burned along the jour­ney. Per­haps we spent a few min­utes edit­ing our next blog post. At the very least, we’re now slightly older, and we have new mem­o­ries, and the sun’s po­si­tion has changed a lit­tle. And so, just like with defi­ni­tion 2, no se­ries of choices can ever demon­strate in­co­her­ent re­vealed prefer­ences in the sense of defi­ni­tion 1, since ev­ery choice ac­tu­ally made is be­tween a differ­ent set of pos­si­ble states. (At the very least, they differ in the agent’s mem­o­ries of which path it took to get there.⁴ And note that out­comes which are iden­ti­cal ex­cept for slight differ­ences in mem­o­ries should some­times be treated in very differ­ent ways, since hav­ing even a few bits of ad­di­tional in­for­ma­tion from ex­plo­ra­tion can be in­cred­ibly ad­van­ta­geous.)

Now, this isn’t so rele­vant in the hu­man con­text be­cause we usu­ally ab­stract away from the small de­tails. For ex­am­ple, if I offer to sell you an ice-cream and you re­fuse it, and then I offer it again a sec­ond later and you ac­cept, I’d take that as ev­i­dence that your prefer­ences are in­co­her­ent—even though tech­ni­cally the two offers are differ­ent be­cause ac­cept­ing the first just leads you to a state where you have an ice-cream, while ac­cept­ing the sec­ond leads you to a state where you both have an ice-cream and re­mem­ber re­fus­ing the first offer. Similarly, I ex­pect that you don’t con­sider two out­comes to be differ­ent if they only differ in the pre­cise pat­tern of TV static or the ex­act timing of leaves rustling. But again, there are no co­her­ence con­straints say­ing that an agent can’t con­sider such fac­tors to be im­mensely sig­nifi­cant, enough to to­tally change their prefer­ences over lot­ter­ies when you sub­sti­tute in one such out­come for the other.

So for the claim that suffi­ciently op­ti­mised agents ap­pear co­her­ent to be non-triv­ially true un­der defi­ni­tion 1, we’d need to clar­ify that such co­her­ence is only with re­spect to out­comes when they’re cat­e­gorised ac­cord­ing to the fea­tures which hu­mans con­sider im­por­tant, ex­cept for the ones which are in­trin­si­cally tem­po­rally ex­tended, con­di­tional on the agent have a rea­son­able prior and not be­ing in­differ­ent over too many op­tions. But then the stan­dard ar­gu­ments from co­her­ence con­straints no longer ap­ply, be­cause they’re based on maths, not the ill-defined con­cepts used in the pre­vi­ous sen­tence. At this point I think it’s bet­ter to aban­don the whole idea of for­mal co­her­ence as a pre­dic­tor of real-world be­havi­our, and re­place it with Ro­hin’s no­tion of “goal-di­rect­ed­ness”, which is more up­front about be­ing in­her­ently sub­jec­tive, and doesn’t rule out any of the goals that hu­mans ac­tu­ally have.

Thanks to Tim Ge­newein, Ra­mana Ku­mar, Vic­to­ria Krakovna, Ro­hin Shah, Toby Ord and Stu­art Arm­strong for dis­cus­sions which led to this post, and helpful com­ments.

[1] Disjoint­ed­ness of out­comes makes this ar­gu­ment more suc­cinct, but it’s not ac­tu­ally a nec­es­sary com­po­nent, be­cause once you’ve re­ceived one out­come, your prefer­ences over all other out­comes are al­lowed to change. For ex­am­ple, hav­ing won $1000000, the value you place on other fi­nan­cial prizes will very likely go down. This is re­lated to my later ar­gu­ment that you never ac­tu­ally have mul­ti­ple paths to end­ing up in the “same” state.

[2] At this point you could ob­ject on a tech­ni­cal­ity: from the uni­tar­ity of quan­tum me­chan­ics, it seems as if the laws of physics are in fact re­versible, and so the cur­rent state of the uni­verse (or mul­ti­verse, rather) ac­tu­ally does con­tain all the in­for­ma­tion you the­o­ret­i­cally need to de­duce whether or not any pre­vi­ous goal has been satis­fied. But I’m limit­ing this claim to macro­scopic-level phe­nom­ena, for two rea­sons. Firstly, I don’t think our ex­pec­ta­tions about the be­havi­our of ad­vanced AI should de­pend on very low-level fea­tures of physics in this way; and sec­ondly, if the ob­jec­tion holds, then prefer­ences over states have all the same prob­lems as prefer­ences over tra­jec­to­ries.

[3] Tech­ni­cal note: I’m as­sum­ing an in­finite time hori­zon and no dis­count­ing, be­cause re­mov­ing ei­ther of those con­di­tions leads to weird be­havi­our which I don’t want to dig into in this post. In the­ory this leaves open the pos­si­bil­ity of in­finite ex­pected re­ward, or of lot­ter­ies over in­finitely many out­comes, but I think that we can just ig­nore these cases with­out chang­ing the core idea be­hind my ar­gu­ment. The un­der­ly­ing as­sump­tion here is some­thing like: whether we model the uni­verse as finite or in­finite shouldn’t sig­nifi­cantly af­fect whether we ex­pect AI be­havi­our to be co­her­ent over the next few cen­turies, for any use­ful defi­ni­tion of co­her­ent.

[4] Per­haps you can con­struct a coun­terex­am­ple in­volv­ing mem­ory loss, but this doesn’t change the over­all point, and if you’re con­cerned with such tech­ni­cal­ities you’ll also have to deal with the prob­lems I laid out in foot­note 2.