Occam’s Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

[Epistemic Sta­tus: My in­side view feels con­fi­dent, but I’ve only dis­cussed this with one other per­son so far, so I won’t be sur­prised if it turns out to be con­fused.]

Arm­strong and Min­der­mann (A&M) ar­gue “that even with a rea­son­able sim­plic­ity prior/​Oc­cam’s ra­zor on the set of de­com­po­si­tions, we can­not dis­t­in­guish be­tween the true de­com­po­si­tion and oth­ers that lead to high re­gret. To ad­dress this, we need sim­ple ‘nor­ma­tive’ as­sump­tions, which can­not be de­duced ex­clu­sively from ob­ser­va­tions.”

I ex­plain why I think their ar­gu­ment is faulty, con­clud­ing that maybe Oc­cam’s Ra­zor is suffi­cient to do the job af­ter all.

In what fol­lows I as­sume the reader is fa­mil­iar with the pa­per already or at least with the con­cepts within it.

Brief sum­mary of A&M’s ar­gu­ment:

(This is merely a brief sketch of A&M’s ar­gu­ment; I’ll en­gage with it in more de­tail be­low. For the full story, read their pa­per.)

Take a hu­man policy pi = P(R) that we are try­ing to rep­re­sent in the plan­ner-re­ward for­mal­ism. R is the hu­man’s re­ward func­tion, which en­codes their de­sires/​prefer­ences/​val­ues/​goals. P() is the hu­man’s plan­ner func­tion, which en­codes how they take their ex­pe­riences as in­put and try to choose out­puts that achieve their re­ward. Pi, then, en­codes the over­all be­hav­ior of the hu­man in ques­tion.

Step 1: In any rea­son­able lan­guage, for any plau­si­ble policy, you can con­struct “de­gen­er­ate” plan­ner-re­ward pairs that are al­most as sim­ple as the sim­plest pos­si­ble way to gen­er­ate the policy, yet yield high re­gret (i.e. have a re­ward com­po­nent which is very differ­ent from the “true”/​”In­tended” one.)

  • Ex­am­ple: The plan­ner de­on­tolog­i­cally fol­lows the policy, de­spite a bud­dha-like empty util­ity function

  • Ex­am­ple: The plan­ner greed­ily max­i­mizes the re­ward func­tion “obe­di­ence-to-the-policy.”

  • Ex­am­ple: Dou­ble-negated ver­sion of ex­am­ple 2.

It’s easy to see that these ex­am­ples, be­ing con­structed from the policy, are at most slightly more com­plex than the sim­plest pos­si­ble way to gen­er­ate the policy, since they could make use of that way.

Step 2: The “in­tended” plan­ner-re­ward pair—the one that hu­mans would judge to be a rea­son­able de­com­po­si­tion of the hu­man policy in ques­tion—is likely to be sig­nifi­cantly more com­plex than the sim­plest pos­si­ble plan­ner-re­ward pair.

  • Ar­gu­ment: It’s re­ally com­pli­cated.

  • Ar­gu­ment: The pair con­tains more in­for­ma­tion than the policy, so it should be more com­pli­cated.

  • Ar­gu­ment: Philoso­phers and economists have been try­ing for years and haven’t suc­ceeded yet.

Con­clu­sion: If we use Oc­cam’s Ra­zor alone to find plan­ner-re­ward pairs that fit a par­tic­u­lar hu­man’s be­hav­ior, we’ll set­tle on one of the de­gen­er­ate ones (or some­thing else en­tirely) rather than a rea­son­able one. This could be very dan­ger­ous if we are build­ing an AI to max­i­mize the re­ward.

Me­thinks the ar­gu­ment proves too much:

My first point is that A&M’s ar­gu­ment prob­a­bly works just as well for other uses of Oc­cam’s Ra­zor. In par­tic­u­lar it works just as well for the canon­i­cal use: find­ing the Laws and Ini­tial Con­di­tions that de­scribe our uni­verse!

Take a se­quence of events we are try­ing to pre­dict/​rep­re­sent with the lawlike-uni­verse for­mal­ism, which posits C (the ini­tial con­di­tions) and then L() the dy­nam­i­cal laws, a func­tion that takes ini­tial con­di­tions and ex­trap­o­lates ev­ery­thing else from them. L(C) = E, the se­quence of events/​con­di­tions/​world-states we are try­ing to pre­dict/​rep­re­sent.

Step 1: In any rea­son­able lan­guage, for any plau­si­ble se­quence of events, we can con­struct “de­gen­er­ate” ini­tial con­di­tion + laws pairs that are al­most as sim­ple as the sim­plest pair.

  • Ex­am­ple: The ini­tial con­di­tions are an empty void, but the laws say “And then the se­quence of events that hap­pens is E”

  • Ex­am­ple: The ini­tial con­di­tions are sim­ply E, and L() doesn’t do any­thing.

It’s easy to see that these ex­am­ples, be­ing con­structed from E, are at most slightly more com­plex than the sim­plest pos­si­ble pair, since they could use the sim­plest pair to gen­er­ate E.

Step 2: The “in­tended” ini­tial con­di­tion+law pair is likely to be sig­nifi­cantly more com­plex than the sim­plest pair.

  • Ar­gu­ment: It’s re­ally com­pli­cated.

  • Ar­gu­ment: The pair con­tains more in­for­ma­tion than the se­quence of events, so it should be more com­pli­cated.

  • Ar­gu­ment: Physi­cists have been try­ing for years and haven’t suc­ceeded yet.

Con­clu­sion: If we use Oc­cam’s Ra­zor alone to find law-con­di­tion pairs that fit all the world’s events, we’ll set­tle on one of the de­gen­er­ate ones (or some­thing else en­tirely) rather than a rea­son­able one. This could be very dan­ger­ous if we are e.g. build­ing an AI to do sci­ence for us and an­swer coun­ter­fac­tual ques­tions like “If we had posted the nu­clear launch codes on the In­ter­net, would any nukes have been launched?”

This con­clu­sion may ac­tu­ally be true, but it’s a pretty con­tro­ver­sial claim and I pre­dict most philoso­phers of sci­ence wouldn’t be im­pressed by this ar­gu­ment for it—even the ones who agree with the con­clu­sion.

Ob­ject­ing to the three ar­gu­ments for Step 2

Con­sider the fol­low­ing hy­poth­e­sis, which is ba­si­cally equiv­a­lent to the claim A&M are try­ing to dis­prove:

Oc­cam Suffi­ciency Hy­poth­e­sis: The “In­tended” pair hap­pens to be the sim­plest way to gen­er­ate the policy.

No­tice that ev­ery­thing in Step 1 is con­sis­tent with this hy­poth­e­sis. The first de­gen­er­ate pairs are con­structed from the policy, so they are more com­pli­cated than the sim­plest way to gen­er­ate it, so if that way is via the in­tended pair, they are more com­pli­cated (albeit only slightly) than the in­tended pair.

Next, no­tice that the three ar­gu­ments in sup­port of Step 2 don’t re­ally hurt this hy­poth­e­sis:

Re: first ar­gu­ment: The in­tended pair can be both very com­plex and the sim­plest way to gen­er­ate the policy; no con­tra­dic­tion there. In­deed that’s not even sur­pris­ing: since the policy is gen­er­ated by a mas­sive messy neu­ral net in an ex­tremely di­verse en­vi­ron­ment, we should ex­pect it to be com­plex. What mat­ters for our pur­poses is not how com­plex the in­tended pair is, but rather how com­plex it is rel­a­tive to the sim­plest pos­si­ble way to gen­er­ate the policy. A&M need to ar­gue that the sim­plest pos­si­ble way to gen­er­ate the policy is sim­pler than the in­tended pair; ar­gu­ing that the in­tended pair is com­plex is at best only half the ar­gu­ment.

Com­pare to the case of physics: Sure, the laws of physics are com­plex. They prob­a­bly take at least a page of code to write up. And that’s as­pira­tional; we haven’t even got to that point yet. But that doesn’t mean Oc­cam’s Ra­zor is in­suffi­cient to find the laws of physics.

Re: sec­ond ar­gu­ment: The in­fer­ence from “This pair con­tains more in­for­ma­tion than the policy” to “this pair is more com­plex than the policy” is fal­la­cious. Of course the in­tended pair con­tains more in­for­ma­tion than the policy! All ways of gen­er­at­ing the policy con­tain more in­for­ma­tion than it. This is be­cause there are many ways (e.g. plan­ner-re­ward pairs) to get any given policy, and thus spec­i­fy­ing any par­tic­u­lar way is giv­ing you strictly more in­for­ma­tion than sim­ply spec­i­fy­ing the policy.

Com­pare to the case of physics: Even once we’ve been given the com­plete his­tory of the world (or a com­plete his­tory of some ar­bi­trar­ily large set of ex­per­i­ment-events) there will still be ad­di­tional things left to spec­ify about what the laws and ini­tial con­di­tions truly are. Do the laws con­tain a dou­ble nega­tion in them, for ex­am­ple? Do they have some weird clause that cre­ates in­finite en­ergy but only when a cer­tain ex­tremely rare in­ter­ac­tion oc­curs that never in fact oc­curs? What lan­guage are the laws writ­ten in, any­way? And what about the ini­tial con­di­tions? Lots of things left to spec­ify that aren’t de­ter­mined by the com­plete his­tory of the world. Yet this does not mean that the Laws + Ini­tial Con­di­tions are more com­plex than the com­plete his­tory of the world, and it cer­tainly doesn’t mean we’ll be led astray if we be­lieve in the Laws+Con­di­tions pair that is sim­plest.

Re: third ar­gu­ment: Yes, peo­ple have been try­ing to find plan­ner-re­ward pairs to ex­plain hu­man be­hav­ior for many years, and yes, no one has man­aged to build a sim­ple al­gorithm to do it yet. In­stead we rely on all sorts of im­plicit and in­tu­itive heuris­tics, and we still don’t suc­ceed fully. But all of this can be said about Physics too. It’s not like physi­cists are liter­ally fol­low­ing the Oc­cam’s Ra­zor al­gorithm—iter­at­ing through all pos­si­ble Law+Con­di­tion pairs in or­der from sim­plest to most com­plex and check­ing each one to see if it out­puts a uni­verse con­sis­tent with all our ob­ser­va­tions. And more­over, physi­cists haven’t suc­ceeded fully ei­ther. Nev­er­the­less, many of us are still con­fi­dent that Oc­cam’s Ra­zor is in prin­ci­ple suffi­cient: If we were to fol­low the al­gorithm ex­actly, with enough data and com­pute, we would even­tu­ally set­tle on a Law+Con­di­tion pair that ac­cu­rately de­scribes re­al­ity, and it would be the true pair. Again, maybe we are wrong about that, but the ar­gu­ments A&M have given so far aren’t con­vinc­ing.

Conclusion

Per­haps Oc­cam’s Ra­zor is in­suffi­cient af­ter all. (In­deed I sus­pect as much, for rea­sons I’ll sketch in the ap­pendix) But as far as I can tell, A&M’s ar­gu­ments are at best very weak ev­i­dence against the suffi­ciency of Oc­cam’s Ra­zor for in­fer­ring hu­man prefer­ences, and more­over they work pretty much just as well against the canon­i­cal use of Oc­cam’s Ra­zor too.

This is a bold claim, so I won’t be sur­prised if it turns out I was con­fused. I look for­ward to hear­ing peo­ple’s feed­back. Thanks in ad­vance! And thanks es­pe­cially to Arm­strong and Min­der­mann if they take the time to re­ply.


Many thanks to Ra­mana Ku­mar for hear­ing me out about this a while ago when we read the pa­per to­gether.


Ap­pendix: So, is Oc­cam’s Ra­zor suffi­cient or not?

--A pri­ori, we should ex­pect some­thing more like a speed prior to be ap­pro­pri­ate for iden­ti­fy­ing the mechanisms of a finite mind, rather than a pure com­plex­ity prior.

--Sure enough, we can think of sce­nar­ios in which e.g. a de­ter­minis­tic uni­verse with some­what sim­ple laws de­vel­ops con­se­quen­tial­ists who run mas­sive simu­la­tions in­clud­ing of our uni­verse and then write down Daniel’s policy in flam­ing let­ters some­where, such that the al­gorithm “Run this de­ter­minis­tic uni­verse un­til you find big flam­ing let­ters, then read out that policy” be­comes a very sim­ple way to gen­er­ate Daniel’s policy. (This is ba­si­cally just the “Univer­sal Prior is Mal­ign” idea ap­plied in a new way.)

--So yeah, pure com­plex­ity prior is prob­a­bly not good. But maybe a speed prior would work, or some­thing like it. Or maybe not. I don’t know.

--One case that seems use­ful to me: Sup­pose we are con­sid­er­ing two ex­pla­na­tions of some­one’s be­hav­ior: (A) They de­sire the well-be­ing of the poor, but [in­sert epicy­cles here to ex­plain why they aren’t donat­ing much, are donat­ing con­spicu­ously, are donat­ing in­effec­tively] and (B) They de­sire their peers (and their selves) to be­lieve that they de­sire the well-be­ing of the poor. Thanks to the epicy­cles in (A), both the­o­ries fit the data equally well. But the­ory B is much more sim­ple. Do we con­clude that this per­son re­ally does de­sire the well-be­ing of the poor, or not? If we think that even though (A) is more com­plex it is also more ac­cu­rate, then yeah it seems like Oc­cam’s Ra­zor is in­suffi­cient to in­fer hu­man prefer­ences. But if we in­stead think “Yeah, this per­son just re­ally doesn’t care, and the proof is how much sim­pler B is than A” then it seems we re­ally are us­ing some­thing like Oc­cam’s Ra­zor to in­fer hu­man prefer­ences. Of course, this is just one case, so the only way it could prove any­thing is as a coun­terex­am­ple. To me it doesn’t seem like a coun­terex­am­ple to Oc­cam’s suffi­ciency, but I could per­haps be con­vinced to change my mind about that.

--Also, I’m pretty sure that once we have bet­ter the­o­ries of the brain and mind, we’ll have new con­cepts and the­o­ret­i­cal posits to ex­plain hu­man be­hav­ior. (e.g. some­thing some­thing Karl Fris­ton some­thing some­thing free en­ergy?) Thus, the sim­plest gen­er­a­tor of a given hu­man’s be­hav­ior will prob­a­bly not di­vide au­to­mat­i­cally into a plan­ner and a re­ward; it’ll prob­a­bly have many com­po­nents and there will be de­bates about which com­po­nents the AI should be faith­ful to (dub these com­po­nents the re­ward) and which com­po­nents the AI should seek to sur­pass (dub these com­po­nents the plan­ner.) Th­ese de­bates may be in­tractable, turn­ing on sub­jec­tive and/​or philo­soph­i­cal con­sid­er­a­tions. So this is an­other sense in which I think yeah, definitely Oc­cam’s Ra­zor isn’t suffi­cient—for we will also need to have a philo­soph­i­cal de­bate about what ra­tio­nal­ity is.