Policy Approval

(ETA: The name “policy ap­proval” wasn’t great. I think I will use the term “policy align­ment” to con­trast with “value align­ment” go­ing for­ward, at the sug­ges­tion of Wei Dai in the com­ments.)

I re­cently had a con­ver­sa­tion with Stu­art Arm­strong in which I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” really is a well-defined thing) and at­tempts to op­tim­ize it is still not per­fectly aligned with you. He chal­lenged me to write up spe­cific ex­amples to back up my claims.

I’ll also give a very sketchy al­tern­at­ive to value learn­ing, which I call policy ap­proval. (The policy ap­proval idea emerged out of a con­ver­sa­tion with Andrew Critch.)


Stu­art Arm­strong has re­cently been do­ing work show­ing the dif­fi­culty of in­fer­ring hu­man val­ues. To sum­mar­ize: be­cause hu­mans are ir­ra­tional, a value-learn­ing ap­proach like CIRL needs to jointly es­tim­ate the hu­man util­ity func­tion and the de­gree to which the hu­man is ra­tional—oth­er­wise, it would take all the mis­takes hu­mans make to be pref­er­ences. Un­for­tu­nately, this leads to a severe prob­lem of iden­ti­fi­ab­il­ity: hu­mans can be as­signed any val­ues what­so­ever if we as­sume the right kind of ir­ra­tion­al­ity, and the usual trick of pre­fer­ring sim­pler hy­po­theses doesn’t seem to help in this case.

I also want to point out that a sim­ilar prob­lem arises even without ir­ra­tion­al­ity. Vladi­mir Nesov ex­plored how prob­ab­il­ity and util­ity can be mixed into each other without chan­ging any de­cisions an agent makes. So, in prin­ciple, we can’t de­term­ine the util­ity or prob­ab­il­ity func­tion of an agent uniquely based on the agent’s be­ha­vior alone (even in­clud­ing hy­po­thet­ical be­ha­vior in coun­ter­fac­tual situ­ations). This fact was dis­covered earlier by Jef­frey and Bolker, and is ana­lyzed in more de­tail in the book The Lo­gic of De­cision. For this reason, I call the trans­form “Jef­frey-Bolker ro­ta­tion”.

To give an il­lus­trat­ive ex­ample: it doesn’t mat­ter whether we as­sign very low prob­ab­il­ity to an event, or care very little about what hap­pens given that event. Sup­pose a love-max­im­iz­ing agent is un­able to as­sign nonzero util­ity to a uni­verse where love isn’t real. The agent may ap­pear to ig­nore evid­ence that love isn’t real. We can in­ter­pret this as not caring what hap­pens con­di­tioned on love not be­ing real; or, equally valid (in terms of the ac­tions which the agent chooses), we can in­ter­pret the agent as hav­ing an ex­tremely low prior prob­ab­il­ity on love not be­ing real.

At MIRI, we some­times use the term “probutil­ity” to in­dic­ate the prob­ab­il­ity,util­ity pair in a way which re­minds us that they can’t be dis­en­tangled from one an­other. Jef­frey-Bolker ro­ta­tion changes prob­ab­il­it­ies and util­it­ies, but does not change the over­all probutil­it­ies.

Given these prob­lems, it would be nice if we did not ac­tu­ally need to learn the hu­man util­ity func­tion. I’ll ad­voc­ate that po­s­i­tion.

My un­der­stand­ing is that Stu­art Arm­strong is op­tim­istic that hu­man val­ues can be in­ferred des­pite these prob­lems, be­cause we have a lot of use­ful prior in­form­a­tion we can take ad­vant­age of.

It is in­tu­it­ive that a CIRL-like agent should learn what is ir­ra­tional and then “throw it out”, IE, de-noise hu­man pref­er­ences by look­ing only at what we really prefer, not at what we mis­takenly do out of short-sighted­ness or other mis­takes. On the other hand, it is not so ob­vi­ous that the prob­ab­il­ity/​util­ity dis­tinc­tion should be handled in the same way. Should an agent dis­en­tangle be­liefs from pref­er­ences just so that it can throw out hu­man be­liefs and op­tim­ize the pref­er­ences alone? I ar­gue against this here.

Main Claim

Ignor­ing is­sues of ir­ra­tion­al­ity or bounded ra­tion­al­ity, what an agent wants out of a helper agent is that the helper agent does pre­ferred things.

Sup­pose a ro­bot is try­ing to help a per­fectly ra­tional hu­man. The hu­man has prob­ab­il­ity func­tion and util­ity func­tion . The ro­bot is in epi­stemic state e. The ro­bot has a set of ac­tions . The pro­pos­i­tion “the ro­bot takes the ith ac­tion when in epi­stemic state e” is writ­ten as . The set of full world-states is S. What the hu­man would like the ro­bot to do is given by:

(Or by the ana­log­ous causal coun­ter­fac­tual, if the hu­man thinks that way.)

This no­tion of what the hu­man wants is in­vari­ant to Jef­frey-Bolker ro­ta­tion; the ro­bot doesn’t need to dis­en­tangle prob­ab­il­ity and util­ity! It only needs to learn probutil­it­ies.

The equa­tion writ­ten above can’t be dir­ectly op­tim­ized, since the ro­bot doesn’t have dir­ect ac­cess to hu­man probutil­it­ies. However, I’ll broadly call any at­tempt to ap­prox­im­ate that equa­tion “policy ap­proval”.

Notice that this is closely ana­log­ous to UDT. UDT solves dy­namic in­con­sist­en­cies—situ­ations in which an AI could pre­dict­ably dis­like the de­cisions of its fu­ture self—by op­tim­iz­ing its ac­tions from the per­spect­ive of a fixed prior, IE, its ini­tial self. Policy ap­proval re­solves in­con­sist­en­cies between the AI and the hu­man by op­tim­iz­ing the AI’s ac­tions from the hu­man’s per­spect­ive. The main point of this post is that we can use this ana­logy to pro­duce counter­examples to the typ­ical value-learn­ing ap­proach, in which the AI tries to op­tim­ize hu­man util­ity but not ac­cord­ing to hu­man be­liefs.

I will some­what ig­nore the dis­tinc­tion between UDT1.0 and UDT1.1.


These ex­amples serve to il­lus­trate that “op­tim­iz­ing hu­man util­ity ac­cord­ing to AI be­liefs” is not ex­actly the same as “do what the hu­man would want you to do”, even when we sup­pose “the hu­man util­ity func­tion” is per­fectly well-defined and can be learned ex­actly by the AI.

In these ex­amples, I will sup­pose that the AI has its own prob­ab­il­ity dis­tri­bu­tion . It reas­ons up­date­lessly with re­spect to evid­ence e it sees, but with full prior know­ledge of the hu­man util­ity func­tion:

I use an up­date­less agent to avoid ac­cus­a­tions that of course an up­date­ful agent would fail clas­sic UDT prob­lems. However, it is not really very im­port­ant for the ex­amples.

I as­sume prior know­ledge of to avoid any tricky is­sues which might arise by at­tempt­ing to com­bine up­date­less­ness with value learn­ing.

Coun­ter­fac­tual Mugging

It seems reas­on­able to sup­pose that the AI will start out with some math­em­at­ical know­ledge. Ima­gine that the AI has a data­base of the­or­ems in memory when it boots up, in­clud­ing the first mil­lion di­gits of pi. Treat these as part of the agent’s prior.

Sup­pose, on the other hand, that the hu­man which the AI wants to help does not know more than a hun­dred di­gits of pi.

The hu­man and the AI will dis­agree on what to do about coun­ter­fac­tual mug­ging with a lo­gical coin in­volving di­gits of pi which the AI knows and the hu­man does not. If Omega ap­proaches the AI, the AI will re­fuse to par­ti­cip­ate, but the hu­man will wish the AI would. If Omega ap­proaches the hu­man, the AI may try to pre­vent the hu­man from par­ti­cip­at­ing, to the ex­tent that it can do so without vi­ol­at­ing other as­pects of the hu­man util­ity func­tion.

“Too Up­date­less”

Maybe the prob­lem with the coun­ter­fac­tual mug­ging ex­ample is that it doesn’t make sense to pro­gram the AI with a bunch of know­ledge in its prior which the hu­man doesn’t have.

We can go in the op­pos­ite ex­treme, and make a broad prior such as the So­lomonoff dis­tri­bu­tion, with no in­form­a­tion about our world in par­tic­u­lar.

I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird res­ults. There could be a world with higher prior prob­ab­il­ity than ours, in­hab­ited by Omegas who ask the AI to op­tim­ize alien val­ues in most uni­verses (in­clud­ing Earth) in ex­change for the Omegas max­im­iz­ing in their own world. (This par­tic­u­lar scen­ario doesn’t seem par­tic­u­larly prob­able, but it does seem quite plaus­ible that some weird uni­verses will have higher prob­ab­il­ity than our uni­verse in the So­lomonoff prior, and may make some such bar­gain.)

Again, this is some­thing which can hap­pen in the max­im­iz­a­tion us­ing but not in the one us­ing -- un­less hu­mans them­selves would ap­prove of the mul­tiver­sal bar­gain.

“Just Hav­ing a Very Dif­fer­ent Prior”

Maybe is neither strictly more know­ledgable than nor less, but the two are very dif­fer­ent on some spe­cific is­sues. Per­haps there’s a spe­cific plan which, when is con­di­tioned on evid­ence so far, looks very likely to have many good con­sequences. con­siders the plan very likely to have many bad con­sequences. Also sup­pose that there aren’t any in­ter­est­ing con­sequences of this plan in coun­ter­fac­tual branches, so UDT con­sid­er­a­tions don’t come in.

Also, sup­pose that there isn’t time to test the dif­fer­ing hy­po­theses in­volved which make hu­mans think this is such a bad plan while AIs think it is so good. The AI has to de­cide right now whether to en­act the plan.

The value-learn­ing agent will im­ple­ment this plan, since it seems good on net for hu­man val­ues. The policy-ap­proval agent will not, since hu­mans wouldn’t want it to.

Ob­vi­ously, one might ques­tion whether it is reas­on­able to as­sume that things got to a point where there was such a large dif­fer­ence of opin­ion between the AI and the hu­mans, and no time to re­solve it. Ar­gu­ably, there should be safe­guards against this scen­ario which the value-learn­ing AI it­self would want to set up, due to facts about hu­man val­ues such as “the hu­mans want to be in­volved in big de­cisions about their fu­ture” or the like.

Non­ethe­less, faced with this situ­ation, it seems like policy-ap­proval agents do the right thing while value-learn­ing agents do not.


Aren’t hu­man be­liefs bad?

Isn’t it prob­lem­atic to op­tim­ize via hu­man be­liefs, since hu­man be­liefs are low-qual­ity?

I think this is some­what true and some­what not.

  • Partly, this is like say­ing “isn’t UDT bad be­cause it doesn’t learn?”—ac­tu­ally, UDT acts as if it up­dates most of the time, so it is wrong to think of it as in­cap­able of learn­ing. Sim­il­arly, al­though the policy-ap­proval agent uses , it will mostly act as if it has up­dated on a lot of in­form­a­tion. So, maybe you be­lieve hu­man be­liefs aren’t very good—but do you think we’re cap­able of learn­ing al­most any­thing even­tu­ally? If so, this may ad­dress a large com­pon­ent of the con­cern. In par­tic­u­lar, if you trust the out­put of cer­tain ma­chine learn­ing al­gorithms more than you trust your­self, the AI can run those al­gorithms and use their out­put.

  • On the other hand, hu­mans prob­ably have in­co­her­ent , and not just be­cause of lo­gical un­cer­tainty. So, the AI still needs to fig­ure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

If hu­mans would want an AI to op­tim­ize via hu­man be­liefs, won’t that be re­flec­ted in the hu­man util­ity func­tion?

Or: If policy-ap­proval were good, wouldn’t a value-learner self modify into policy-ap­proval any­way?

I don’t think this is true, but I’m not sure. Cer­tainly there could be simple agents who value-learners co­oper­ate with without ever de­cid­ing to self-modify into policy-ap­proval agents. Per­haps there is some­thing about hu­man pref­er­ence which de­sires the AI to co­oper­ate with the hu­man even when the AI thinks this is (oth­er­wise) net-neg­at­ive for hu­man val­ues.

Aren’t I ig­nor­ing the fact that the AI needs its own be­liefs?

In “Just Hav­ing a Very Dif­fer­ent Prior”, I claimed that if and dis­agree about the con­sequences of a plan, value-learn­ing can do some­thing hu­mans strongly don’t want it to do, whereas policy-ap­proval can­not. However, my defin­i­tion of policy-ap­proval ig­nores learn­ing. Real­ist­ic­ally, the policy-ap­proval agent needs to also have be­liefs , which it uses to ap­prox­im­ate the hu­man ap­proval of its ac­tions. Can’t the same large dis­agree­ment emerge from this?

I think the con­cern is qual­it­at­ively less, be­cause the policy-ap­proval agent uses only to es­tim­ate and . If the AI knows that hu­mans would have a large dis­agree­ment with the plan, the policy-ap­proval agent would not im­ple­ment the plan, while the value-learn­ing agent would.

For policy-ap­proval to go wrong, it needs to have a bad es­tim­ate of and .

The policy is too big.

Even if the pro­cess of learn­ing is do­ing the work to turn it into a co­her­ent prob­ab­il­ity dis­tri­bu­tion (re­mov­ing ir­ra­tion­al­ity and mak­ing things well-defined), it still may not be able to con­ceive of im­port­ant pos­sib­il­it­ies. The evid­ence which the AI uses to de­cide how to act, in the equa­tions given earlier, may be a large data stream with some hu­man-in­com­pre­hens­ible parts.

As a res­ult, it seems like the AI needs to op­tim­ize over com­pact/​ab­stract rep­res­ent­a­tions of its policy, sim­il­arly to how policy se­lec­tion in lo­gical in­duct­ors works.

This isn’t an en­tirely sat­is­fact­ory an­swer, since (1) the rep­res­ent­a­tion of a policy as a com­puter pro­gram could still es­cape hu­man un­der­stand­ing, and (2) it is un­clear what it means to cor­rectly rep­res­ent the policy in a hu­man-un­der­stand­able way.


Aside from is­sues with the ap­proach, my term “policy ap­proval” may be ter­rible. It sounds too much like “ap­proval-dir­ec­ted agent”, which means some­thing dif­fer­ent. I think there are sim­il­ar­it­ies, but they aren’t strong enough to jus­tify re­fer­ring to both as “ap­proval”. Any sug­ges­tions?


(These are very spec­u­lat­ive.)

Lo­gical Up­date­less­ness?

One of the ma­jor obstacles to pro­gress in de­cision the­ory right now is that we don’t know of a good up­date­less per­spect­ive for lo­gical un­cer­tainty. Maybe a policy-ap­proval agent doesn’t need to solve this prob­lem, since it tries to op­tim­ize from the hu­man per­spect­ive rather than its own. Roughly: lo­gical up­date­less­ness is hard be­cause it tends to fall into the “too up­date­less” is­sue above. So, maybe it can be a non-is­sue in the right for­mu­la­tion of policy ap­proval.


Stu­art Arm­strong is some­what pess­im­istic about cor­ri­gib­il­ity. Per­haps there is some­thing which can be done in policy-ap­proval land which can’t be done oth­er­wise. The “Just Hav­ing Very Dif­fer­ent Pri­ors” ex­ample points in this dir­ec­tion; it is an ex­ample where policy-ap­proval acts in a much more cor­ri­gible way.

A value-learn­ing agent can al­ways res­ist hu­mans if it is highly con­fid­ant that its plan is a good one which hu­mans are op­pos­ing ir­ra­tion­ally. A policy-ap­proval agent can think its plan is a good one but also think that hu­mans would prefer it to be cor­ri­gible on prin­ciple re­gard­less of that.

On the other hand, a policy-ap­proval agent isn’t guar­an­teed to think that. Per­haps policy-ap­proval learn­ing can be spe­cified with some kind of highly cor­ri­gible bias, so that it re­quires a lot of evid­ence to de­cide that hu­mans don’t want it to be­have cor­ri­gibly in a par­tic­u­lar case?


I’ve left out some spec­u­la­tion about what policy-ap­proval agents should ac­tu­ally look like, for the sake of keep­ing mostly to the point (the dis­cus­sion with Stu­art). I like this idea be­cause it in­volves a change in per­spect­ive of what an agent should be, sim­ilar to the change which UDT it­self made.