Policy Approval

(ETA: The name “policy ap­proval” wasn’t great. I think I will use the term “policy al­ign­ment” to con­trast with “value al­ign­ment” go­ing for­ward, at the sug­ges­tion of Wei Dai in the com­ments.)

I re­cently had a con­ver­sa­tion with Stu­art Arm­strong in which I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” re­ally is a well-defined thing) and at­tempts to op­ti­mize it is still not perfectly al­igned with you. He challenged me to write up spe­cific ex­am­ples to back up my claims.

I’ll also give a very sketchy al­ter­na­tive to value learn­ing, which I call policy ap­proval. (The policy ap­proval idea emerged out of a con­ver­sa­tion with An­drew Critch.)


Stu­art Arm­strong has re­cently been do­ing work show­ing the difficulty of in­fer­ring hu­man val­ues. To sum­ma­rize: be­cause hu­mans are ir­ra­tional, a value-learn­ing ap­proach like CIRL needs to jointly es­ti­mate the hu­man util­ity func­tion and the de­gree to which the hu­man is ra­tio­nal—oth­er­wise, it would take all the mis­takes hu­mans make to be prefer­ences. Un­for­tu­nately, this leads to a se­vere prob­lem of iden­ti­fi­a­bil­ity: hu­mans can be as­signed any val­ues what­so­ever if we as­sume the right kind of ir­ra­tional­ity, and the usual trick of prefer­ring sim­pler hy­pothe­ses doesn’t seem to help in this case.

I also want to point out that a similar prob­lem arises even with­out ir­ra­tional­ity. Vladimir Nesov ex­plored how prob­a­bil­ity and util­ity can be mixed into each other with­out chang­ing any de­ci­sions an agent makes. So, in prin­ci­ple, we can’t de­ter­mine the util­ity or prob­a­bil­ity func­tion of an agent uniquely based on the agent’s be­hav­ior alone (even in­clud­ing hy­po­thet­i­cal be­hav­ior in coun­ter­fac­tual situ­a­tions). This fact was dis­cov­ered ear­lier by Jeffrey and Bolker, and is an­a­lyzed in more de­tail in the book The Logic of De­ci­sion. For this rea­son, I call the trans­form “Jeffrey-Bolker ro­ta­tion”.

To give an illus­tra­tive ex­am­ple: it doesn’t mat­ter whether we as­sign very low prob­a­bil­ity to an event, or care very lit­tle about what hap­pens given that event. Sup­pose a love-max­i­miz­ing agent is un­able to as­sign nonzero util­ity to a uni­verse where love isn’t real. The agent may ap­pear to ig­nore ev­i­dence that love isn’t real. We can in­ter­pret this as not car­ing what hap­pens con­di­tioned on love not be­ing real; or, equally valid (in terms of the ac­tions which the agent chooses), we can in­ter­pret the agent as hav­ing an ex­tremely low prior prob­a­bil­ity on love not be­ing real.

At MIRI, we some­times use the term “probu­til­ity” to in­di­cate the prob­a­bil­ity,util­ity pair in a way which re­minds us that they can’t be dis­en­tan­gled from one an­other. Jeffrey-Bolker ro­ta­tion changes prob­a­bil­ities and util­ities, but does not change the over­all probu­til­ities.

Given these prob­lems, it would be nice if we did not ac­tu­ally need to learn the hu­man util­ity func­tion. I’ll ad­vo­cate that po­si­tion.

My un­der­stand­ing is that Stu­art Arm­strong is op­ti­mistic that hu­man val­ues can be in­ferred de­spite these prob­lems, be­cause we have a lot of use­ful prior in­for­ma­tion we can take ad­van­tage of.

It is in­tu­itive that a CIRL-like agent should learn what is ir­ra­tional and then “throw it out”, IE, de-noise hu­man prefer­ences by look­ing only at what we re­ally pre­fer, not at what we mis­tak­enly do out of short-sight­ed­ness or other mis­takes. On the other hand, it is not so ob­vi­ous that the prob­a­bil­ity/​util­ity dis­tinc­tion should be han­dled in the same way. Should an agent dis­en­tan­gle be­liefs from prefer­ences just so that it can throw out hu­man be­liefs and op­ti­mize the prefer­ences alone? I ar­gue against this here.

Main Claim

Ig­nor­ing is­sues of ir­ra­tional­ity or bounded ra­tio­nal­ity, what an agent wants out of a helper agent is that the helper agent does preferred things.

Sup­pose a robot is try­ing to help a perfectly ra­tio­nal hu­man. The hu­man has prob­a­bil­ity func­tion and util­ity func­tion . The robot is in epistemic state e. The robot has a set of ac­tions . The propo­si­tion “the robot takes the ith ac­tion when in epistemic state e” is writ­ten as . The set of full world-states is S. What the hu­man would like the robot to do is given by:

(Or by the analo­gous causal coun­ter­fac­tual, if the hu­man thinks that way.)

This no­tion of what the hu­man wants is in­var­i­ant to Jeffrey-Bolker ro­ta­tion; the robot doesn’t need to dis­en­tan­gle prob­a­bil­ity and util­ity! It only needs to learn probu­til­ities.

The equa­tion writ­ten above can’t be di­rectly op­ti­mized, since the robot doesn’t have di­rect ac­cess to hu­man probu­til­ities. How­ever, I’ll broadly call any at­tempt to ap­prox­i­mate that equa­tion “policy ap­proval”.

No­tice that this is closely analo­gous to UDT. UDT solves dy­namic in­con­sis­ten­cies—situ­a­tions in which an AI could pre­dictably dis­like the de­ci­sions of its fu­ture self—by op­ti­miz­ing its ac­tions from the per­spec­tive of a fixed prior, IE, its ini­tial self. Policy ap­proval re­solves in­con­sis­ten­cies be­tween the AI and the hu­man by op­ti­miz­ing the AI’s ac­tions from the hu­man’s per­spec­tive. The main point of this post is that we can use this anal­ogy to pro­duce coun­terex­am­ples to the typ­i­cal value-learn­ing ap­proach, in which the AI tries to op­ti­mize hu­man util­ity but not ac­cord­ing to hu­man be­liefs.

I will some­what ig­nore the dis­tinc­tion be­tween UDT1.0 and UDT1.1.


Th­ese ex­am­ples serve to illus­trate that “op­ti­miz­ing hu­man util­ity ac­cord­ing to AI be­liefs” is not ex­actly the same as “do what the hu­man would want you to do”, even when we sup­pose “the hu­man util­ity func­tion” is perfectly well-defined and can be learned ex­actly by the AI.

In these ex­am­ples, I will sup­pose that the AI has its own prob­a­bil­ity dis­tri­bu­tion . It rea­sons up­date­lessly with re­spect to ev­i­dence e it sees, but with full prior knowl­edge of the hu­man util­ity func­tion:

I use an up­date­less agent to avoid ac­cu­sa­tions that of course an up­date­ful agent would fail clas­sic UDT prob­lems. How­ever, it is not re­ally very im­por­tant for the ex­am­ples.

I as­sume prior knowl­edge of to avoid any tricky is­sues which might arise by at­tempt­ing to com­bine up­date­less­ness with value learn­ing.

Coun­ter­fac­tual Mugging

It seems rea­son­able to sup­pose that the AI will start out with some math­e­mat­i­cal knowl­edge. Imag­ine that the AI has a database of the­o­rems in mem­ory when it boots up, in­clud­ing the first mil­lion digits of pi. Treat these as part of the agent’s prior.

Sup­pose, on the other hand, that the hu­man which the AI wants to help does not know more than a hun­dred digits of pi.

The hu­man and the AI will dis­agree on what to do about coun­ter­fac­tual mug­ging with a log­i­cal coin in­volv­ing digits of pi which the AI knows and the hu­man does not. If Omega ap­proaches the AI, the AI will re­fuse to par­ti­ci­pate, but the hu­man will wish the AI would. If Omega ap­proaches the hu­man, the AI may try to pre­vent the hu­man from par­ti­ci­pat­ing, to the ex­tent that it can do so with­out vi­o­lat­ing other as­pects of the hu­man util­ity func­tion.

“Too Up­date­less”

Maybe the prob­lem with the coun­ter­fac­tual mug­ging ex­am­ple is that it doesn’t make sense to pro­gram the AI with a bunch of knowl­edge in its prior which the hu­man doesn’t have.

We can go in the op­po­site ex­treme, and make a broad prior such as the Solomonoff dis­tri­bu­tion, with no in­for­ma­tion about our world in par­tic­u­lar.

I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird re­sults. There could be a world with higher prior prob­a­bil­ity than ours, in­hab­ited by Omegas who ask the AI to op­ti­mize alien val­ues in most uni­verses (in­clud­ing Earth) in ex­change for the Omegas max­i­miz­ing in their own world. (This par­tic­u­lar sce­nario doesn’t seem par­tic­u­larly prob­a­ble, but it does seem quite plau­si­ble that some weird uni­verses will have higher prob­a­bil­ity than our uni­verse in the Solomonoff prior, and may make some such bar­gain.)

Again, this is some­thing which can hap­pen in the max­i­miza­tion us­ing but not in the one us­ing -- un­less hu­mans them­selves would ap­prove of the mul­ti­ver­sal bar­gain.

“Just Hav­ing a Very Differ­ent Prior”

Maybe is nei­ther strictly more knowl­edgable than nor less, but the two are very differ­ent on some spe­cific is­sues. Per­haps there’s a spe­cific plan which, when is con­di­tioned on ev­i­dence so far, looks very likely to have many good con­se­quences. con­sid­ers the plan very likely to have many bad con­se­quences. Also sup­pose that there aren’t any in­ter­est­ing con­se­quences of this plan in coun­ter­fac­tual branches, so UDT con­sid­er­a­tions don’t come in.

Also, sup­pose that there isn’t time to test the differ­ing hy­pothe­ses in­volved which make hu­mans think this is such a bad plan while AIs think it is so good. The AI has to de­cide right now whether to en­act the plan.

The value-learn­ing agent will im­ple­ment this plan, since it seems good on net for hu­man val­ues. The policy-ap­proval agent will not, since hu­mans wouldn’t want it to.

Ob­vi­ously, one might ques­tion whether it is rea­son­able to as­sume that things got to a point where there was such a large differ­ence of opinion be­tween the AI and the hu­mans, and no time to re­solve it. Ar­guably, there should be safe­guards against this sce­nario which the value-learn­ing AI it­self would want to set up, due to facts about hu­man val­ues such as “the hu­mans want to be in­volved in big de­ci­sions about their fu­ture” or the like.

Nonethe­less, faced with this situ­a­tion, it seems like policy-ap­proval agents do the right thing while value-learn­ing agents do not.


Aren’t hu­man be­liefs bad?

Isn’t it prob­le­matic to op­ti­mize via hu­man be­liefs, since hu­man be­liefs are low-qual­ity?

I think this is some­what true and some­what not.

  • Partly, this is like say­ing “isn’t UDT bad be­cause it doesn’t learn?”—ac­tu­ally, UDT acts as if it up­dates most of the time, so it is wrong to think of it as in­ca­pable of learn­ing. Similarly, al­though the policy-ap­proval agent uses , it will mostly act as if it has up­dated on a lot of in­for­ma­tion. So, maybe you be­lieve hu­man be­liefs aren’t very good—but do you think we’re ca­pa­ble of learn­ing al­most any­thing even­tu­ally? If so, this may ad­dress a large com­po­nent of the con­cern. In par­tic­u­lar, if you trust the out­put of cer­tain ma­chine learn­ing al­gorithms more than you trust your­self, the AI can run those al­gorithms and use their out­put.

  • On the other hand, hu­mans prob­a­bly have in­co­her­ent , and not just be­cause of log­i­cal un­cer­tainty. So, the AI still needs to figure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

If hu­mans would want an AI to op­ti­mize via hu­man be­liefs, won’t that be re­flected in the hu­man util­ity func­tion?

Or: If policy-ap­proval were good, wouldn’t a value-learner self mod­ify into policy-ap­proval any­way?

I don’t think this is true, but I’m not sure. Cer­tainly there could be sim­ple agents who value-learn­ers co­op­er­ate with with­out ever de­cid­ing to self-mod­ify into policy-ap­proval agents. Per­haps there is some­thing about hu­man prefer­ence which de­sires the AI to co­op­er­ate with the hu­man even when the AI thinks this is (oth­er­wise) net-nega­tive for hu­man val­ues.

Aren’t I ig­nor­ing the fact that the AI needs its own be­liefs?

In “Just Hav­ing a Very Differ­ent Prior”, I claimed that if and dis­agree about the con­se­quences of a plan, value-learn­ing can do some­thing hu­mans strongly don’t want it to do, whereas policy-ap­proval can­not. How­ever, my defi­ni­tion of policy-ap­proval ig­nores learn­ing. Real­is­ti­cally, the policy-ap­proval agent needs to also have be­liefs , which it uses to ap­prox­i­mate the hu­man ap­proval of its ac­tions. Can’t the same large dis­agree­ment emerge from this?

I think the con­cern is qual­i­ta­tively less, be­cause the policy-ap­proval agent uses only to es­ti­mate and . If the AI knows that hu­mans would have a large dis­agree­ment with the plan, the policy-ap­proval agent would not im­ple­ment the plan, while the value-learn­ing agent would.

For policy-ap­proval to go wrong, it needs to have a bad es­ti­mate of and .

The policy is too big.

Even if the pro­cess of learn­ing is do­ing the work to turn it into a co­her­ent prob­a­bil­ity dis­tri­bu­tion (re­mov­ing ir­ra­tional­ity and mak­ing things well-defined), it still may not be able to con­ceive of im­por­tant pos­si­bil­ities. The ev­i­dence which the AI uses to de­cide how to act, in the equa­tions given ear­lier, may be a large data stream with some hu­man-in­com­pre­hen­si­ble parts.

As a re­sult, it seems like the AI needs to op­ti­mize over com­pact/​ab­stract rep­re­sen­ta­tions of its policy, similarly to how policy se­lec­tion in log­i­cal in­duc­tors works.

This isn’t an en­tirely satis­fac­tory an­swer, since (1) the rep­re­sen­ta­tion of a policy as a com­puter pro­gram could still es­cape hu­man un­der­stand­ing, and (2) it is un­clear what it means to cor­rectly rep­re­sent the policy in a hu­man-un­der­stand­able way.


Aside from is­sues with the ap­proach, my term “policy ap­proval” may be ter­rible. It sounds too much like “ap­proval-di­rected agent”, which means some­thing differ­ent. I think there are similar­i­ties, but they aren’t strong enough to jus­tify refer­ring to both as “ap­proval”. Any sug­ges­tions?


(Th­ese are very spec­u­la­tive.)

Log­i­cal Up­date­less­ness?

One of the ma­jor ob­sta­cles to progress in de­ci­sion the­ory right now is that we don’t know of a good up­date­less per­spec­tive for log­i­cal un­cer­tainty. Maybe a policy-ap­proval agent doesn’t need to solve this prob­lem, since it tries to op­ti­mize from the hu­man per­spec­tive rather than its own. Roughly: log­i­cal up­date­less­ness is hard be­cause it tends to fall into the “too up­date­less” is­sue above. So, maybe it can be a non-is­sue in the right for­mu­la­tion of policy ap­proval.


Stu­art Arm­strong is some­what pes­simistic about cor­rigi­bil­ity. Per­haps there is some­thing which can be done in policy-ap­proval land which can’t be done oth­er­wise. The “Just Hav­ing Very Differ­ent Pri­ors” ex­am­ple points in this di­rec­tion; it is an ex­am­ple where policy-ap­proval acts in a much more cor­rigible way.

A value-learn­ing agent can always re­sist hu­mans if it is highly con­fi­dant that its plan is a good one which hu­mans are op­pos­ing ir­ra­tionally. A policy-ap­proval agent can think its plan is a good one but also think that hu­mans would pre­fer it to be cor­rigible on prin­ci­ple re­gard­less of that.

On the other hand, a policy-ap­proval agent isn’t guaran­teed to think that. Per­haps policy-ap­proval learn­ing can be speci­fied with some kind of highly cor­rigible bias, so that it re­quires a lot of ev­i­dence to de­cide that hu­mans don’t want it to be­have cor­rigibly in a par­tic­u­lar case?


I’ve left out some spec­u­la­tion about what policy-ap­proval agents should ac­tu­ally look like, for the sake of keep­ing mostly to the point (the dis­cus­sion with Stu­art). I like this idea be­cause it in­volves a change in per­spec­tive of what an agent should be, similar to the change which UDT it­self made.