Policy Approval

(ETA: The name “policy ap­proval” wasn’t great. I think I will use the term “policy al­ign­ment” to con­trast with “value al­ign­ment” go­ing for­ward, at the sug­ges­tion of Wei Dai in the com­ments.)

I re­cently had a con­ver­sa­tion with Stu­art Arm­strong in which I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” re­ally is a well-defined thing) and at­tempts to op­ti­mize it is still not perfectly al­igned with you. He challenged me to write up spe­cific ex­am­ples to back up my claims.

I’ll also give a very sketchy al­ter­na­tive to value learn­ing, which I call policy ap­proval. (The policy ap­proval idea emerged out of a con­ver­sa­tion with An­drew Critch.)

Background

Stu­art Arm­strong has re­cently been do­ing work show­ing the difficulty of in­fer­ring hu­man val­ues. To sum­ma­rize: be­cause hu­mans are ir­ra­tional, a value-learn­ing ap­proach like CIRL needs to jointly es­ti­mate the hu­man util­ity func­tion and the de­gree to which the hu­man is ra­tio­nal—oth­er­wise, it would take all the mis­takes hu­mans make to be prefer­ences. Un­for­tu­nately, this leads to a se­vere prob­lem of iden­ti­fi­a­bil­ity: hu­mans can be as­signed any val­ues what­so­ever if we as­sume the right kind of ir­ra­tional­ity, and the usual trick of prefer­ring sim­pler hy­pothe­ses doesn’t seem to help in this case.

I also want to point out that a similar prob­lem arises even with­out ir­ra­tional­ity. Vladimir Nesov ex­plored how prob­a­bil­ity and util­ity can be mixed into each other with­out chang­ing any de­ci­sions an agent makes. So, in prin­ci­ple, we can’t de­ter­mine the util­ity or prob­a­bil­ity func­tion of an agent uniquely based on the agent’s be­hav­ior alone (even in­clud­ing hy­po­thet­i­cal be­hav­ior in coun­ter­fac­tual situ­a­tions). This fact was dis­cov­ered ear­lier by Jeffrey and Bolker, and is an­a­lyzed in more de­tail in the book The Logic of De­ci­sion. For this rea­son, I call the trans­form “Jeffrey-Bolker ro­ta­tion”.

To give an illus­tra­tive ex­am­ple: it doesn’t mat­ter whether we as­sign very low prob­a­bil­ity to an event, or care very lit­tle about what hap­pens given that event. Sup­pose a love-max­i­miz­ing agent is un­able to as­sign nonzero util­ity to a uni­verse where love isn’t real. The agent may ap­pear to ig­nore ev­i­dence that love isn’t real. We can in­ter­pret this as not car­ing what hap­pens con­di­tioned on love not be­ing real; or, equally valid (in terms of the ac­tions which the agent chooses), we can in­ter­pret the agent as hav­ing an ex­tremely low prior prob­a­bil­ity on love not be­ing real.

At MIRI, we some­times use the term “probu­til­ity” to in­di­cate the prob­a­bil­ity,util­ity pair in a way which re­minds us that they can’t be dis­en­tan­gled from one an­other. Jeffrey-Bolker ro­ta­tion changes prob­a­bil­ities and util­ities, but does not change the over­all probu­til­ities.

Given these prob­lems, it would be nice if we did not ac­tu­ally need to learn the hu­man util­ity func­tion. I’ll ad­vo­cate that po­si­tion.

My un­der­stand­ing is that Stu­art Arm­strong is op­ti­mistic that hu­man val­ues can be in­ferred de­spite these prob­lems, be­cause we have a lot of use­ful prior in­for­ma­tion we can take ad­van­tage of.

It is in­tu­itive that a CIRL-like agent should learn what is ir­ra­tional and then “throw it out”, IE, de-noise hu­man prefer­ences by look­ing only at what we re­ally pre­fer, not at what we mis­tak­enly do out of short-sight­ed­ness or other mis­takes. On the other hand, it is not so ob­vi­ous that the prob­a­bil­ity/​util­ity dis­tinc­tion should be han­dled in the same way. Should an agent dis­en­tan­gle be­liefs from prefer­ences just so that it can throw out hu­man be­liefs and op­ti­mize the prefer­ences alone? I ar­gue against this here.

Main Claim

Ig­nor­ing is­sues of ir­ra­tional­ity or bounded ra­tio­nal­ity, what an agent wants out of a helper agent is that the helper agent does preferred things.

Sup­pose a robot is try­ing to help a perfectly ra­tio­nal hu­man. The hu­man has prob­a­bil­ity func­tion and util­ity func­tion . The robot is in epistemic state e. The robot has a set of ac­tions . The propo­si­tion “the robot takes the ith ac­tion when in epistemic state e” is writ­ten as . The set of full world-states is S. What the hu­man would like the robot to do is given by:

(Or by the analo­gous causal coun­ter­fac­tual, if the hu­man thinks that way.)

This no­tion of what the hu­man wants is in­var­i­ant to Jeffrey-Bolker ro­ta­tion; the robot doesn’t need to dis­en­tan­gle prob­a­bil­ity and util­ity! It only needs to learn probu­til­ities.

The equa­tion writ­ten above can’t be di­rectly op­ti­mized, since the robot doesn’t have di­rect ac­cess to hu­man probu­til­ities. How­ever, I’ll broadly call any at­tempt to ap­prox­i­mate that equa­tion “policy ap­proval”.

No­tice that this is closely analo­gous to UDT. UDT solves dy­namic in­con­sis­ten­cies—situ­a­tions in which an AI could pre­dictably dis­like the de­ci­sions of its fu­ture self—by op­ti­miz­ing its ac­tions from the per­spec­tive of a fixed prior, IE, its ini­tial self. Policy ap­proval re­solves in­con­sis­ten­cies be­tween the AI and the hu­man by op­ti­miz­ing the AI’s ac­tions from the hu­man’s per­spec­tive. The main point of this post is that we can use this anal­ogy to pro­duce coun­terex­am­ples to the typ­i­cal value-learn­ing ap­proach, in which the AI tries to op­ti­mize hu­man util­ity but not ac­cord­ing to hu­man be­liefs.

I will some­what ig­nore the dis­tinc­tion be­tween UDT1.0 and UDT1.1.

Examples

Th­ese ex­am­ples serve to illus­trate that “op­ti­miz­ing hu­man util­ity ac­cord­ing to AI be­liefs” is not ex­actly the same as “do what the hu­man would want you to do”, even when we sup­pose “the hu­man util­ity func­tion” is perfectly well-defined and can be learned ex­actly by the AI.

In these ex­am­ples, I will sup­pose that the AI has its own prob­a­bil­ity dis­tri­bu­tion . It rea­sons up­date­lessly with re­spect to ev­i­dence e it sees, but with full prior knowl­edge of the hu­man util­ity func­tion:

I use an up­date­less agent to avoid ac­cu­sa­tions that of course an up­date­ful agent would fail clas­sic UDT prob­lems. How­ever, it is not re­ally very im­por­tant for the ex­am­ples.

I as­sume prior knowl­edge of to avoid any tricky is­sues which might arise by at­tempt­ing to com­bine up­date­less­ness with value learn­ing.

Coun­ter­fac­tual Mugging

It seems rea­son­able to sup­pose that the AI will start out with some math­e­mat­i­cal knowl­edge. Imag­ine that the AI has a database of the­o­rems in mem­ory when it boots up, in­clud­ing the first mil­lion digits of pi. Treat these as part of the agent’s prior.

Sup­pose, on the other hand, that the hu­man which the AI wants to help does not know more than a hun­dred digits of pi.

The hu­man and the AI will dis­agree on what to do about coun­ter­fac­tual mug­ging with a log­i­cal coin in­volv­ing digits of pi which the AI knows and the hu­man does not. If Omega ap­proaches the AI, the AI will re­fuse to par­ti­ci­pate, but the hu­man will wish the AI would. If Omega ap­proaches the hu­man, the AI may try to pre­vent the hu­man from par­ti­ci­pat­ing, to the ex­tent that it can do so with­out vi­o­lat­ing other as­pects of the hu­man util­ity func­tion.

“Too Up­date­less”

Maybe the prob­lem with the coun­ter­fac­tual mug­ging ex­am­ple is that it doesn’t make sense to pro­gram the AI with a bunch of knowl­edge in its prior which the hu­man doesn’t have.

We can go in the op­po­site ex­treme, and make a broad prior such as the Solomonoff dis­tri­bu­tion, with no in­for­ma­tion about our world in par­tic­u­lar.

I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird re­sults. There could be a world with higher prior prob­a­bil­ity than ours, in­hab­ited by Omegas who ask the AI to op­ti­mize alien val­ues in most uni­verses (in­clud­ing Earth) in ex­change for the Omegas max­i­miz­ing in their own world. (This par­tic­u­lar sce­nario doesn’t seem par­tic­u­larly prob­a­ble, but it does seem quite plau­si­ble that some weird uni­verses will have higher prob­a­bil­ity than our uni­verse in the Solomonoff prior, and may make some such bar­gain.)

Again, this is some­thing which can hap­pen in the max­i­miza­tion us­ing but not in the one us­ing -- un­less hu­mans them­selves would ap­prove of the mul­ti­ver­sal bar­gain.

“Just Hav­ing a Very Differ­ent Prior”

Maybe is nei­ther strictly more knowl­edgable than nor less, but the two are very differ­ent on some spe­cific is­sues. Per­haps there’s a spe­cific plan which, when is con­di­tioned on ev­i­dence so far, looks very likely to have many good con­se­quences. con­sid­ers the plan very likely to have many bad con­se­quences. Also sup­pose that there aren’t any in­ter­est­ing con­se­quences of this plan in coun­ter­fac­tual branches, so UDT con­sid­er­a­tions don’t come in.

Also, sup­pose that there isn’t time to test the differ­ing hy­pothe­ses in­volved which make hu­mans think this is such a bad plan while AIs think it is so good. The AI has to de­cide right now whether to en­act the plan.

The value-learn­ing agent will im­ple­ment this plan, since it seems good on net for hu­man val­ues. The policy-ap­proval agent will not, since hu­mans wouldn’t want it to.

Ob­vi­ously, one might ques­tion whether it is rea­son­able to as­sume that things got to a point where there was such a large differ­ence of opinion be­tween the AI and the hu­mans, and no time to re­solve it. Ar­guably, there should be safe­guards against this sce­nario which the value-learn­ing AI it­self would want to set up, due to facts about hu­man val­ues such as “the hu­mans want to be in­volved in big de­ci­sions about their fu­ture” or the like.

Nonethe­less, faced with this situ­a­tion, it seems like policy-ap­proval agents do the right thing while value-learn­ing agents do not.

Is­sues/​Objections

Isn’t it prob­le­matic to op­ti­mize via hu­man be­liefs, since hu­man be­liefs are low-qual­ity?

I think this is some­what true and some­what not.

• Partly, this is like say­ing “isn’t UDT bad be­cause it doesn’t learn?”—ac­tu­ally, UDT acts as if it up­dates most of the time, so it is wrong to think of it as in­ca­pable of learn­ing. Similarly, al­though the policy-ap­proval agent uses , it will mostly act as if it has up­dated on a lot of in­for­ma­tion. So, maybe you be­lieve hu­man be­liefs aren’t very good—but do you think we’re ca­pa­ble of learn­ing al­most any­thing even­tu­ally? If so, this may ad­dress a large com­po­nent of the con­cern. In par­tic­u­lar, if you trust the out­put of cer­tain ma­chine learn­ing al­gorithms more than you trust your­self, the AI can run those al­gorithms and use their out­put.

• On the other hand, hu­mans prob­a­bly have in­co­her­ent , and not just be­cause of log­i­cal un­cer­tainty. So, the AI still needs to figure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

If hu­mans would want an AI to op­ti­mize via hu­man be­liefs, won’t that be re­flected in the hu­man util­ity func­tion?

Or: If policy-ap­proval were good, wouldn’t a value-learner self mod­ify into policy-ap­proval any­way?

I don’t think this is true, but I’m not sure. Cer­tainly there could be sim­ple agents who value-learn­ers co­op­er­ate with with­out ever de­cid­ing to self-mod­ify into policy-ap­proval agents. Per­haps there is some­thing about hu­man prefer­ence which de­sires the AI to co­op­er­ate with the hu­man even when the AI thinks this is (oth­er­wise) net-nega­tive for hu­man val­ues.

Aren’t I ig­nor­ing the fact that the AI needs its own be­liefs?

In “Just Hav­ing a Very Differ­ent Prior”, I claimed that if and dis­agree about the con­se­quences of a plan, value-learn­ing can do some­thing hu­mans strongly don’t want it to do, whereas policy-ap­proval can­not. How­ever, my defi­ni­tion of policy-ap­proval ig­nores learn­ing. Real­is­ti­cally, the policy-ap­proval agent needs to also have be­liefs , which it uses to ap­prox­i­mate the hu­man ap­proval of its ac­tions. Can’t the same large dis­agree­ment emerge from this?

I think the con­cern is qual­i­ta­tively less, be­cause the policy-ap­proval agent uses only to es­ti­mate and . If the AI knows that hu­mans would have a large dis­agree­ment with the plan, the policy-ap­proval agent would not im­ple­ment the plan, while the value-learn­ing agent would.

For policy-ap­proval to go wrong, it needs to have a bad es­ti­mate of and .

The policy is too big.

Even if the pro­cess of learn­ing is do­ing the work to turn it into a co­her­ent prob­a­bil­ity dis­tri­bu­tion (re­mov­ing ir­ra­tional­ity and mak­ing things well-defined), it still may not be able to con­ceive of im­por­tant pos­si­bil­ities. The ev­i­dence which the AI uses to de­cide how to act, in the equa­tions given ear­lier, may be a large data stream with some hu­man-in­com­pre­hen­si­ble parts.

As a re­sult, it seems like the AI needs to op­ti­mize over com­pact/​ab­stract rep­re­sen­ta­tions of its policy, similarly to how policy se­lec­tion in log­i­cal in­duc­tors works.

This isn’t an en­tirely satis­fac­tory an­swer, since (1) the rep­re­sen­ta­tion of a policy as a com­puter pro­gram could still es­cape hu­man un­der­stand­ing, and (2) it is un­clear what it means to cor­rectly rep­re­sent the policy in a hu­man-un­der­stand­able way.

Terminology

Aside from is­sues with the ap­proach, my term “policy ap­proval” may be ter­rible. It sounds too much like “ap­proval-di­rected agent”, which means some­thing differ­ent. I think there are similar­i­ties, but they aren’t strong enough to jus­tify refer­ring to both as “ap­proval”. Any sug­ges­tions?

(Th­ese are very spec­u­la­tive.)

Log­i­cal Up­date­less­ness?

One of the ma­jor ob­sta­cles to progress in de­ci­sion the­ory right now is that we don’t know of a good up­date­less per­spec­tive for log­i­cal un­cer­tainty. Maybe a policy-ap­proval agent doesn’t need to solve this prob­lem, since it tries to op­ti­mize from the hu­man per­spec­tive rather than its own. Roughly: log­i­cal up­date­less­ness is hard be­cause it tends to fall into the “too up­date­less” is­sue above. So, maybe it can be a non-is­sue in the right for­mu­la­tion of policy ap­proval.

Cor­rigi­bil­ity?

Stu­art Arm­strong is some­what pes­simistic about cor­rigi­bil­ity. Per­haps there is some­thing which can be done in policy-ap­proval land which can’t be done oth­er­wise. The “Just Hav­ing Very Differ­ent Pri­ors” ex­am­ple points in this di­rec­tion; it is an ex­am­ple where policy-ap­proval acts in a much more cor­rigible way.

A value-learn­ing agent can always re­sist hu­mans if it is highly con­fi­dant that its plan is a good one which hu­mans are op­pos­ing ir­ra­tionally. A policy-ap­proval agent can think its plan is a good one but also think that hu­mans would pre­fer it to be cor­rigible on prin­ci­ple re­gard­less of that.

On the other hand, a policy-ap­proval agent isn’t guaran­teed to think that. Per­haps policy-ap­proval learn­ing can be speci­fied with some kind of highly cor­rigible bias, so that it re­quires a lot of ev­i­dence to de­cide that hu­mans don’t want it to be­have cor­rigibly in a par­tic­u­lar case?

Conclusion

I’ve left out some spec­u­la­tion about what policy-ap­proval agents should ac­tu­ally look like, for the sake of keep­ing mostly to the point (the dis­cus­sion with Stu­art). I like this idea be­cause it in­volves a change in per­spec­tive of what an agent should be, similar to the change which UDT it­self made.

• I’m con­fused. If the AI knows a mil­lion digits of pi, and it can pre­vent Omega from coun­ter­fac­tu­ally mug­ging me where it knows I will lose money… shouldn’t it try to pre­vent that from hap­pen­ing? That seems like the right be­hav­ior to me. Similarly, if I knew that the AI knows a mil­lion digits of pi, then if it gets coun­ter­fac­tu­ally mugged, it shouldn’t give up the money.

(Per­haps the ar­gu­ment is that as long as Omega was un­cer­tain about the digit when de­cid­ing what game to pro­pose, then you should pay up as nec­es­sary, re­gard­less of what you know. But if that’s the ar­gu­ment, then why can’t the AI go through the same rea­son­ing?)

Ig­nor­ing is­sues of ir­ra­tional­ity or bounded ra­tio­nal­ity, what an agent wants out of a helper agent is that the helper agent does preferred things.

If the AI knows the win­ning num­bers for the lot­tery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the win­ning num­bers) I would dis­pre­fer that ac­tion. Even bet­ter would be if it ex­plained to me what it was do­ing, af­ter which I would pre­fer the ac­tion, but let’s say that wasn’t pos­si­ble for some rea­son (maybe it performed a very com­plex simu­la­tion of the world to figure out the win­ning num­ber).

It seems like if the AI knows my util­ity func­tion and is op­ti­miz­ing it, that does perform well. Now for prac­ti­cal rea­sons, we prob­a­bly want to in­stead build an AI that does what we pre­fer it to do, but this seems to be be­cause it would be hard to learn the right util­ity func­tion, and er­rors along the way could lead to catas­tro­phe, not be­cause it would be bad for the AI to op­ti­mize the right util­ity func­tion.

ETA: My straw­man-ML-ver­sion of your ar­gu­ment is that you would pre­fer imi­ta­tion learn­ing in­stead of in­verse re­in­force­ment learn­ing (which differ when the AI and hu­man know differ­ent things). This seems wrong to me.

• I’m con­fused. If the AI knows a mil­lion digits of pi, and it can pre­vent Omega from coun­ter­fac­tu­ally mug­ging me where it knows I will lose money… shouldn’t it try to pre­vent that from hap­pen­ing? That seems like the right be­hav­ior to me. Similarly, if I knew that the AI knows a mil­lion digits of pi, then if it gets coun­ter­fac­tu­ally mugged, it shouldn’t give up the money.

If you don’t think one should pay up in coun­ter­fac­tual mug­ging in gen­eral, then my ar­gu­ment won’t land. Rather than ar­gu­ing that you want to be coun­ter­fac­tu­ally mugged, I’ll try and ar­gue a differ­ent de­ci­sion prob­lem.

Sup­pose that Omega is run­ning a fairly sim­ple and quick al­gorithm which is nonethe­less able to pre­dict an AI with more pro­cess­ing power, due to us­ing a stronger logic or similar tricks. Omega will put ei­ther $10 or$1000 in a box. Our AI can press a but­ton on the box to get ei­ther all or half of the money in­side. Omega puts in $1000 if it pre­dicts that our AI will take half the money; oth­er­wise, it puts in$10.

We sup­pose that, since there is a short proof of ex­actly what Omega does, it is already pre­sent in the math­e­mat­i­cal database in­cluded in the AI’s prior.

If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pected util­ity. So, it will get only $10 from Omega. If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pec­ta­tion in the hu­man’s ex­pec­ta­tion: tak­ing half, or tak­ing it all. It’s quite pos­si­ble in this case that it takes all the money. (Per­haps the ar­gu­ment is that as long as Omega was un­cer­tain about the digit when de­cid­ing what game to pro­pose, then you should pay up as nec­es­sary, re­gard­less of what you know. But if that’s the ar­gu­ment, then why can’t the AI go through the same rea­son­ing?) That is part of the ar­gu­ment for pay­ing up in coun­ter­fac­tual mug­ging, yes. But both us and Omega need to be un­cer­tain about the digit, since if our prior can already pre­dict that Omega is go­ing to ask us for$10 rather than give us any money, there’s no rea­son for us to pay up. So, it de­pends on the prior, and can turn out differ­ently if our vs the agent’s prior is used.

If the AI knows the win­ning num­bers for the lot­tery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the win­ning num­bers) I would dis­pre­fer that ac­tion.

If I think that the AI tends to be mis­cal­ibrated about lot­tery-ticket be­liefs, there is no rea­son for me to want it to buy the ticket. If I think it is cal­ibrated about lot­tery-tir­ket be­liefs, I’ll like the policy of buy­ing lot­tery tick­ets in such cases, so the AI will buy.

You could ar­gue that an AI which is try­ing to be helpful will buy lot­tery tick­ets in such cases no mat­ter how de­luded the hu­mans think it is. But, not only is this not very cor­rigible be­hav­ior, but also it doesn’t make any sense from our per­spec­tive to make an AI rea­son in that way: we don’t want the AI to act in ways which we have good rea­son to be­lieve are un­re­li­able.

ETA: My straw­man-ML-ver­sion of your ar­gu­ment is that you would pre­fer imi­ta­tion learn­ing in­stead of in­verse re­in­force­ment learn­ing (which differ when the AI and hu­man know differ­ent things). This seems wrong to me.

The anal­ogy isn’t perfect, since the AI can still do things to max­i­mize hu­man ap­proval which the hu­man would never have thought of, as well as things which the hu­man could think of but didn’t have the com­pu­ta­tional re­sources to do. It does seem like a fairly good anal­ogy, though.

• Okay, I think I mi­s­un­der­stood what you were claiming in this post. Based on the fol­low­ing line:

I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” re­ally is a well-defined thing) and at­tempts to op­ti­mize it is still not perfectly al­igned with you.

I thought you were ar­gu­ing, “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that perfectly op­ti­mizes this true util­ity func­tion is still not al­igned with you.” (Yes, hav­ing writ­ten it down I can see that is not what you ac­tu­ally said, but that’s the in­ter­pre­ta­tion I origi­nally ended up with.)

I would now rephrase your claim as “Even as­sum­ing we know the true util­ity func­tion, op­ti­miz­ing it is hard.”

Ex­am­ples:

You could ar­gue that an AI which is try­ing to be helpful will buy lot­tery tick­ets in such cases no mat­ter how de­luded the hu­mans think it is. But, not only is this not very cor­rigible be­hav­ior, but also it doesn’t make any sense from our per­spec­tive to make an AI rea­son in that way: we don’t want the AI to act in ways which we have good rea­son to be­lieve are un­re­li­able.

Yeah, an AI that op­ti­mizes the true util­ity func­tion prob­a­bly won’t be cor­rigible. From a the­o­ret­i­cal stand­point, that seems fine—cor­rigi­bil­ity seems like an eas­ier tar­get to shoot for, not a nec­es­sary as­pect of an al­igned AI. The rea­son we don’t want the sce­nario above is “we have good rea­son to be­lieve [the AI is] un­re­li­able”, which sounds like the AI is failing to op­ti­mize the util­ity func­tion cor­rectly.

If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pected util­ity. So, it will get only $10 from Omega. If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pec­ta­tion in the hu­man’s ex­pec­ta­tion: tak­ing half, or tak­ing it all. It’s quite pos­si­ble in this case that it takes all the money. This also sounds like the value-learn­ing agent is sim­ply bad at cor­rectly op­ti­miz­ing the true util­ity func­tion. (It seems to me that all of de­ci­sion the­ory is about how to prop­erly op­ti­mize a util­ity func­tion in the­ory.) We can go in the op­po­site ex­treme, and make PR a broad prior such as the Solomonoff dis­tri­bu­tion, with no in­for­ma­tion about our world in par­tic­u­lar. I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird re­sults. Again, seems like this pro­posal for mak­ing an al­igned AI is just bad at op­ti­miz­ing the true util­ity func­tion. So I guess the way I would sum­ma­rize this post: • Value learn­ing is hard. • Even if you know the cor­rect util­ity func­tion, op­ti­miz­ing it is hard. • In­stead of try­ing to value learn and then op­ti­mize, just go straight for the policy in­stead, which is safer than rely­ing on ac­cu­rately de­com­pos­ing a hu­man into two differ­ent things that are both difficult to learn and have weird in­ter­ac­tions with each other. Is this right? • I thought you were ar­gu­ing, “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that perfectly op­ti­mizes this true util­ity func­tion is still not al­igned with you.” (Yes, hav­ing writ­ten it down I can see that is not what you ac­tu­ally said, but that’s the in­ter­pre­ta­tion I origi­nally ended up with.) I would cor­rect it to “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that perfectly op­ti­mizes this in ex­pec­ta­tion ac­cord­ing to some prior is still not al­igned with you.” I would now rephrase your claim as “Even as­sum­ing we know the true util­ity func­tion, op­ti­miz­ing it is hard.” This part is tricky for me to in­ter­pret. On the one hand, yes: speci­fi­cally, even if you have all the pro­cess­ing power you need, you still need to op­ti­mize via a par­tic­u­lar prior (AIXI op­ti­mizes via Solomonoff in­duc­tion) since you can’t di­rectly see what the con­se­quences of your ac­tions will be. So, I’m speci­fi­cally point­ing at an as­pect of “op­ti­miz­ing it is hard” which is about hav­ing a good prior. You could say that “util­ity” is the true tar­get, and “ex­pected util­ity” is the proxy which you have to use in de­ci­sion the­ory. On the other hand, this might be a mis­lead­ing way of fram­ing the prob­lem. It sug­gests that some­thing with a perfect prior (mag­i­cally ex­actly equal to the uni­verse we’re ac­tu­ally in) would be perfectly al­igned: “If you know the true util­ity func­tion, and you know the true state of the uni­verse and con­se­quences of al­ter­na­tive ac­tions you can take, then you are al­igned.” This isn’t nec­es­sar­ily ob­jec­tion­able, but it is not the no­tion of al­ign­ment in the post. • If the AI mag­i­cally has the “true uni­verse” prior, this gives hu­mans no rea­son to trust it. The hu­mans might rea­son­ably con­clude that it is over­con­fi­dent, and want to shut it down. If it jus­tifi­ably has the true uni­verse prior, and can ex­plain why the prior must be right in a way that hu­mans can un­der­stand, then the AI is al­igned in the sense of the post. • The Jeffrey-Bolker ro­ta­tion (men­tioned in the post) gives me some rea­son to think of the prior and the util­ity func­tion as one ob­ject, so that it doesn’t make sense to think about “the true hu­man util­ity func­tion” in iso­la­tion. None of my choice be­hav­ior (be it re­vealed prefer­ences or ver­bally claimed prefer­ences etc) can differ­en­ti­ate be­tween me as­sign­ing small prob­a­bil­ity to a set of pos­si­bil­ities (but car­ing mod­er­ately about what hap­pens in those pos­si­bil­ities) and as­sign­ing a mod­er­ate prob­a­bil­ity (but car­ing very lit­tle what hap­pens one way or an­other in those wor­lds). So, I’m not even sure it is sen­si­ble to think of alone as cap­tur­ing hu­man prefer­ences; maybe doesn’t re­ally make sense apart from . So, to sum­ma­rize, 1. I agree that “even as­sum­ing we know the true util­ity func­tion, op­ti­miz­ing it is hard”—but I am speci­fi­cally point­ing at the fact that we need be­liefs to sup­ple­ment util­ity func­tions, so that we can max­i­mize ex­pected util­ity as a proxy for util­ity. And this proxy can be bad. 2. Even un­der the ideal­ized as­sump­tion that hu­mans are perfectly co­her­ent de­ci­sion-the­o­retic agents, I’m not sure it makes sense to say there’s a “true hu­man util­ity func­tion”—the VNM the­o­rem only gets a which is unique up to such-and-such by as­sum­ing a fixed no­tion of prob­a­bil­ity. The Jeffrey-Bolker rep­re­sen­ta­tion the­o­rem, which jus­tifies ra­tio­nal agents hav­ing prob­a­bil­ity and util­ity func­tions in one the­o­rem rather than jus­tify­ing the two in­de­pen­dently, shows that we can do this “ro­ta­tion” which shifts which part of the prefer­ences are rep­re­sented in the prob­a­bil­ity vs in the util­ity, with­out chang­ing the un­der­ly­ing prefer­ences. 3. If we think of the ob­jec­tive as “build­ing AI such that there is a good ar­gu­ment for hu­mans trust­ing that the AI has hu­man in­ter­est in mind” rather than “build­ing AI which op­ti­mizes hu­man util­ity”, then we nat­u­rally want to solve #1 in a way which takes hu­man be­liefs into ac­count. This ad­dresses the con­cern from #2; we don’t ac­tu­ally have to figure out which part of prefer­ences are “prob­a­bil­ity” vs “util­ity”. • It sug­gests that some­thing with a perfect prior (mag­i­cally ex­actly equal to the uni­verse we’re ac­tu­ally in) would be perfectly al­igned: “If you know the true util­ity func­tion, and you know the true state of the uni­verse and con­se­quences of al­ter­na­tive ac­tions you can take, then you are al­igned.” This isn’t nec­es­sar­ily ob­jec­tion­able, but it is not the no­tion of al­ign­ment in the post. If the AI mag­i­cally has the “true uni­verse” prior, this gives hu­mans no rea­son to trust it. The hu­mans might rea­son­ably con­clude that it is over­con­fi­dent, and want to shut it down. If it jus­tifi­ably has the true uni­verse prior, and can ex­plain why the prior must be right in a way that hu­mans can un­der­stand, then the AI is al­igned in the sense of the post. Sure. I was claiming that it is also a rea­son­able no­tion of al­ign­ment. My rea­son for not us­ing that no­tion of al­ign­ment is that it doesn’t seem prac­ti­cally re­al­iz­able. How­ever, if we could mag­i­cally give the AI the “true uni­verse” prior with the “true util­ity func­tion”, I would be happy and say we were done, even if it wasn’t jus­tifi­able and couldn’t ex­plain it to hu­mans. I agree it would not be al­igned in the sense of the post. So, I’m not even sure it is sen­si­ble to think of UH alone as cap­tur­ing hu­man prefer­ences; maybe UH doesn’t re­ally make sense apart from PH. This seems to ar­gue that if my AI knew the win­ning lot­tery num­bers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the win­ning lot­tery ticket. I agree the Jeffrey-Bolker ro­ta­tion seems to in­di­cate that we should think of probu­til­ities in­stead of prob­a­bil­ities and util­ities sep­a­rately, but it seems like there re­ally are some very clear ac­tual differ­ences in the real world, and we should ac­count for it some­how. Per­haps one differ­ence is that prob­a­bil­ities change in re­sponse to new in­for­ma­tion, whereas (ideal­ized) util­ity func­tions don’t. (Ob­vi­ously hu­mans don’t have ideal­ized util­ity func­tions, but this is all a the­o­ret­i­cal ex­er­cise any­way.) I agree that “even as­sum­ing we know the true util­ity func­tion, op­ti­miz­ing it is hard”—but I am speci­fi­cally point­ing at the fact that we need be­liefs to sup­ple­ment util­ity func­tions, so that we can max­i­mize ex­pected util­ity as a proxy for util­ity. And this proxy can be bad. Thanks for clar­ify­ing, that’s clearer to me now. If we think of the ob­jec­tive as “build­ing AI such that there is a good ar­gu­ment for hu­mans trust­ing that the AI has hu­man in­ter­est in mind” rather than “build­ing AI which op­ti­mizes hu­man util­ity”, then we nat­u­rally want to solve #1 in a way which takes hu­man be­liefs into ac­count. This ad­dresses the con­cern from #2; we don’t ac­tu­ally have to figure out which part of prefer­ences are “prob­a­bil­ity” vs “util­ity”. I gen­er­ally agree with the ob­jec­tive you pro­pose (for prac­ti­cal rea­sons). The ob­vi­ous way to do this is to do imi­ta­tion learn­ing, where (to a first ap­prox­i­ma­tion) you just copy the hu­man’s policy. (Or al­ter­na­tively, have the policy that a hu­man would ap­prove of you hav­ing.) This won’t let you ex­ceed hu­man in­tel­li­gence, which seems like a pretty big prob­lem. Do you ex­pect an AI us­ing policy al­ign­ment to do bet­ter than hu­mans at tasks? If so, how is it do­ing bet­ter? My nor­mal an­swer to this in the EV frame­work is “it has bet­ter es­ti­mates of prob­a­bil­ities of fu­ture states”, but we can’t do that any more. Per­haps you’re hop­ing that the AI can ex­plain its plan to a hu­man, and the hu­man will then ap­prove of it even though they wouldn’t have be­fore the ex­pla­na­tion. In that case, the hu­man’s probu­til­ities have changed, which means that policy al­ign­ment is now “al­ign­ment to a thing that I can ma­nipu­late”, which seems bad. Fwiw I am gen­er­ally in fa­vor of ap­proaches along the lines of policy al­ign­ment, I’m more con­fused about the the­ory be­hind it here. • I’m not even sure whether you are closer or fur­ther from un­der­stand­ing what I meant, now. I think you are prob­a­bly closer, but stat­ing it in a way I wouldn’t. I see that I need to do some care­ful dis­am­bigua­tion of back­ground as­sump­tions and lan­guage. In­stead of try­ing to value learn and then op­ti­mize, just go straight for the policy in­stead, which is safer than rely­ing on ac­cu­rately de­com­pos­ing a hu­man into two differ­ent things that are both difficult to learn and have weird in­ter­ac­tions with each other. This part, at least, is get­ting at the same in­tu­ition I’m com­ing from. How­ever, I can only as­sume that you are con­fused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talk­ing much about di­rectly learn­ing the poli­cies. (I am think­ing I’ll write an­other post to make that con­nec­tion clearer.) I will have to think harder about the differ­ence be­tween how you’re fram­ing things and how I would frame things, to try to clar­ify more. • I’m not even sure whether you are closer or fur­ther from un­der­stand­ing what I meant, now. :( I can only as­sume that you are con­fused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talk­ing much about di­rectly learn­ing the poli­cies. My as­sump­tion was that you were ar­gu­ing for why learn­ing poli­cies di­rectly (as­sum­ing we could do it) has ad­van­tages over the de­fault ap­proach of value learn­ing + op­ti­miza­tion. That fram­ing seems to ex­plain most of the post. • It’s quite pos­si­ble in this case that it takes all the money. Did you mean to say that it’s quite pos­si­ble that it takes half the money? • Separately, I still don’t un­der­stand the coun­ter­fac­tual mug­ging case. (Dis­claimer, I haven’t gone through any math around coun­ter­fac­tual mug­ging.) It seems re­ally strange that if the hu­man was cer­tain about the digit, they wouldn’t pay up, but if the hu­man is un­cer­tain about the digit but is cer­tain that the AI knows the digit, then the hu­man would not want the AI to in­ter­vene. But pos­si­bly it’s not worth get­ting into this de­tail. Omega will put ei­ther$10 or $1000 in a box. Our AI can press a but­ton on the box to get ei­ther all or half of the money in­side. Omega puts in$1000 if it pre­dicts that our AI will take half the money; oth­er­wise, it puts in $10. We sup­pose that, since there is a short proof of ex­actly what Omega does, it is already pre­sent in the math­e­mat­i­cal database in­cluded in the AI’s prior. If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pected util­ity. So, it will get only$10 from Omega.
If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pec­ta­tion in the hu­man’s ex­pec­ta­tion: tak­ing half, or tak­ing it all. It’s quite pos­si­ble in this case that it takes all the money.

I think as­sum­ing that you have ac­cess to the proof of what Omega does means that you have already de­ter­mined your own be­hav­ior. Pre­sum­ably, “what Omega does” de­pends on your own policy, so if you have a proof about what Omega does, that proof also de­ter­mines your ac­tion, and there is noth­ing left for the agent to con­sider.

To be clear, I think it’s rea­son­able to con­sider AIs that try to figure out proofs of “what Omega does”, but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) in­fluence what Omega does. And if it’s not part of the prior, then I think a value-learn­ing agent with a good de­ci­sion the­ory can get the $500. • I think as­sum­ing that you have ac­cess to the proof of what Omega does means that you have already de­ter­mined your own be­hav­ior. You may not rec­og­nize it as such, es­pe­cially if Omega is us­ing a differ­ent ax­iom sys­tem than you. So, you can still be ig­no­rant of what you’ll do while know­ing what Omega’s pre­dic­tion of you is. This makes it im­pos­si­ble for your prob­a­bil­ity dis­tri­bu­tion to treat the two as cor­re­lated any­more. but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) in­fluence what Omega does Yeah, that’s the prob­lem here. And if it’s not part of the prior, then I think a value-learn­ing agent with a good de­ci­sion the­ory can get the$500.

Only if the agent takes that one proof out of the prior, but still has enough struc­ture in the prior to see how the de­ci­sion prob­lem plays out. This is the prob­lem of con­struct­ing a thin prior. You can (more or less) solve any de­ci­sion prob­lem by mak­ing the agent suffi­ciently up­date­less, but you run up against the prob­lem of mak­ing it too up­date­less, at which point it be­haves in ab­surd ways (lack­ing enough struc­ture to even un­der­stand the con­se­quences of poli­cies cor­rectly).

Hence the in­tu­ition that the cor­rect prior to be up­date­less with re­spect to is the hu­man one (which is, es­sen­tially, the main point of the post).

• Hey there!

A use­ful thing would be an ex­am­ple of when a policy ap­proval agent would do some­thing that a hu­man wouldn’t, and what gains in effi­ciency the policy ap­proval agent has over a nor­mal hu­man act­ing.

I feel that the for­mu­la­tion “the hu­mans have a util­ity func­tion” may ob­scure part of what’s go­ing on. Part of the ad­van­tages of ap­proval agents is that they al­low hu­mans to ex­press their some­times in­co­her­ent meta-prefer­ences as well (“yeah, I want to do X, but don’t force me to do it”). As­sum­ing the hu­man prefer­ences are already co­her­ent re­duces the at­trac­tion of the ap­proach.

• Ah, I agree that this pro­posal may have bet­ter ways to re­lax the as­sump­tion that the hu­man has a util­ity func­tion than value-learn­ing does. I wanted to fo­cus on the sim­pler case here. Per­haps I’ll write a fol­low-up post con­sid­er­ing the gen­er­al­iza­tion.

Maybe I’ll try to in­sert an ex­am­ple where the policy ap­proval agent does some­thing the hu­man wouldn’t into this post, though.

Here’s a first stab: sup­pose that the AI has a sub­rou­tine which solves com­plex plan­ning prob­lems. Fur­ther­more, the hu­man trusts the sub­rou­tine (does not ex­pect it to be clev­erly choos­ing plans which solve the prob­lems as stated but cause other prob­lems). The hu­man is smart enough to for­mu­late day-to-day man­age­ment prob­lems which arise at work as for­mally-speci­fied plan­ning prob­lems, and would like to be told what the an­swer to those prob­lems are. In this case, the AI will tell the hu­man those an­swers.

This also illus­trates a limited way the policy-ap­proval agent can avoid over-op­ti­miz­ing sim­plified prob­lem state­ments: if the hu­man does not trust the plan­ning sub­rou­tine (ex­pects it to good­hart or such), then the AI will not use such a sub­rou­tine.

(This isn’t max­i­mally satis­fac­tory, since the hu­man may eas­ily be mis­taken about what sub­rou­tines to trust. I think the AI can do a lit­tle bet­ter than this, but maybe not in a way which ad­dresses the fun­da­men­tal is­sue.)

• Iter­ated dis­til­la­tion and am­plifi­ca­tion seems like an ex­am­ple of a thing that is like policy ap­proval, and it could do lots of things that a hu­man is un­able to, such as be­com­ing re­ally good at chess or Go. (You can imag­ine re­mov­ing the dis­til­la­tion steps if those seem too differ­ent from policy ap­proval, and the point still ap­plies.)

• What about call­ing it “policy al­ign­ment” in anal­ogy with “value al­ign­ment”?

So, the AI still needs to figure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

Since I’m very con­fused about what my should be (I may be happy to change it in any num­ber of ways if some­one gave me the cor­rect solu­tions to a bunch of philo­soph­i­cal prob­lems), there may not be any­thing “real” in my that I’d want an AI to learn and use in an un­crit­i­cal way. It seems like this mostly comes down to what prob­a­bil­ities re­ally are: if prob­a­bil­ities are some­thing ob­jec­tive like “how real” or “how much ex­is­tence” each pos­si­ble world is/​has, then I’d want an AI to use its greater in­tel­lect to figure out what is the cor­rect prior and use that, but if prob­a­bil­ities are some­thing sub­jec­tive like how much I care about each pos­si­ble world, then maybe I’d want the AI to learn and use my . I’m kind of con­fused that you give a bunch of what seem to me to be less im­por­tant con­sid­er­a­tions on whether the AI should use my prob­a­bil­ity func­tion or its own to make de­ci­sions, and don’t men­tion this one.

• “Policy al­ign­ment” seems like an im­prove­ment, es­pe­cially since “policy ap­proval” in­vokes gov­ern­ment policy.

With re­spect to the rest:

On the one hand, I’m tempted to say that to the ex­tent you rec­og­nize how con­fused you are about what prob­a­bil­ities are, and that this con­fu­sion has to do with how you rea­son in the real world, your is go­ing to change a lot when up­dated on cer­tain philo­soph­i­cal ar­gu­ments. As a re­sult, op­ti­miz­ing a strat­egy up­date­lessly via is go­ing to take that into ac­count, shift­ing be­hav­ior sig­nifi­cantly in con­tin­gen­cies in which var­i­ous philo­soph­i­cal ar­gu­ments emerge, and po­ten­tially putting a sig­nifi­cant amount of pro­cess­ing power to­ward search­ing for such ar­gu­ments.

On the other hand, I buy my “policy al­ign­ment” pro­posal only to the ex­tent that I buy UDT, which is not en­tirely. I don’t know how to think about UDT to­gether with the shift­ing prob­a­bil­ities which come from log­i­cal in­duc­tion. The prob­lem is similar to the one you out­line: just as it is un­clear that a hu­man should think its own has any use­ful con­tent which should be locked in for­ever in an up­date­less rea­soner, it is similarly un­clear that a fixed log­i­cal in­duc­tor state (af­ter run­ning for a finite amount of time) has any use­ful con­tent which one would want to lock in for­ever.

I don’t yet know how to think about this prob­lem. I sus­pect there’s some­thing non-ob­vi­ous to be said about the ex­tent to which trusts other be­lief dis­tri­bu­tions (IE, some­thing at least a bit more com­pel­ling than the an­swer I gave first, but not en­tirely differ­ent in form).

• I was re­ally sur­prised that the “back­ground prob­lem” is al­most the same prob­lem as in value learn­ing in some for­mu­la­tions of bounded ra­tio­nal­ity. In in­for­ma­tion-the­o­retic bounded ra­tio­nal­ity for­mal­ism, the bounded agent acts based on com­bi­na­tion of prior (rep­re­sent­ing pre­vi­ous knowl­edge) and util­ities (what the agent wants). (It seems in some cases of up­dat­ing hu­mans, it is pos­si­ble to dis­en­tan­gle the two.)

While the “coun­terex­am­ples” to “op­ti­miz­ing hu­man util­ity ac­cord­ing to AI be­lief” show how this fails in some­what tricky cases, it seems to me it will be easy to find “coun­terex­am­ples” where “policy-ap­proval agent” would fail (as com­pared to what is in­tu­itively good)

From an “en­g­ineer­ing per­spec­tive”, if I was forced to choose some­thing right now, it would be an AI “op­ti­miz­ing hu­man util­ity ac­cord­ing to AI be­liefs” but ask­ing for clar­ifi­ca­tion when such choice di­verges too much from the “policy-ap­proval”.

• While the “coun­terex­am­ples” to “op­ti­miz­ing hu­man util­ity ac­cord­ing to AI be­lief” show how this fails in some­what tricky cases, it seems to me it will be easy to find “coun­terex­am­ples” where “policy-ap­proval agent” would fail (as com­pared to what is in­tu­itively good)

I agree that it’ll be easy to find coun­terex­am­ples to policy-ap­proval, but I think it’ll be harder than for value-al­ign­ment agents. We have the ad­van­tage that (in the limited sense pro­vided by the as­sump­tion that the hu­man has a co­her­ent prob­a­bil­ity and util­ity) we can prove that we “do what the hu­man would want” (in a more com­pre­hen­sive sense than we can for value al­ign­ment).

• To me this sort of ap­proach feels like a non-starter be­cause you’re ig­nor­ing the thing that gen­er­ates the policy in fa­vor of the policy it­self, which would seem to ex­pose you to Good­hart­ing that would be even worse than the Good­hart­ing we ex­pect in terms of val­ues since policy is a grosser in­stru­ment. Is there some way in which you think this is not that case, namely that fo­cus­ing on policy al­ign­ment would help us bet­ter avoid Good­hart­ing than is pos­si­ble with value al­ign­ment?

• Even if the pro­cess of learn­ing P_H is do­ing the work to turn it into a co­her­ent prob­a­bil­ity dis­tri­bu­tion (re­mov­ing ir­ra­tional­ity and mak­ing things well-defined), the end re­sult may find situ­a­tions which the AI finds it­self in too com­plex to be con­ceived.

I had trou­ble pars­ing the end of this sen­tence. Is the idea that the AI might get into situ­a­tions that are too com­plex for the hu­mans to un­der­stand?

• Yeah. I’ve ed­ited it a bit for clar­ity.

• “Ig­nor­ing is­sues of ir­ra­tional­ity or bounded ra­tio­nal­ity, what an agent wants out of a helper agent is that the helper agent does preferred things.”

I don’t want a “helper agent” to do what I think I’d pre­fer it to do. I mean, I REALLY don’t want that or any­thing like that.

If I wanted that, I could just set it up to fol­low or­ders to the best of its un­der­stand­ing, and then or­der it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve out­comes I can’t fore­see in ways I can’t think up.

What I in­tu­itively want it to do is what makes me hap­piest with the state of the world af­ter it’s done it. That par­tic­u­lar for­mu­la­tion may get hairy with cases where its ac­tions al­ter my prefer­ences, but just aban­don­ing ev­ery pos­si­ble im­prove­ment in fa­vor of my pre-ex­ist­ing guesses about de­sir­able ac­tions isn’t a satis­fac­tory an­swer.

• If I wanted that, I could just set it up to fol­low or­ders to the best of its un­der­stand­ing, and then or­der it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve out­comes I can’t fore­see in ways I can’t think up.

The AI here can do things which you wouldn’t think up.

For ex­am­ple, it could have more com­pu­ta­tional power than you to search for plans which max­i­mize ex­pected util­ity ac­cord­ing to your prob­a­bil­ity and util­ity func­tions. Then, it could tell you the an­swer, if you’re the kind of per­son who likes to be told those kinds of an­swers (IE, if this doesn’t vi­o­late your sense of au­ton­omy/​self-de­ter­mi­na­tion).

Or, if there is any al­gorithm whose be­liefs you trust more than your own, or would trust more than your own if some con­di­tions held (which the AI can it­self check), then the AI can op­ti­mize your util­ity func­tion un­der ex­pected value un­der rather than un­der your own be­liefs, since you would pre­fer that.

• For ex­am­ple, it could have more com­pu­ta­tional power than you to search for plans which max­i­mize ex­pected util­ity ac­cord­ing to your prob­a­bil­ity and util­ity func­tions. Then, it could tell you the answer

Would it, though? It’s not eval­u­at­ing ac­tions on my fu­ture probu­til­ity, oth­er­wise it would wire­head me. It’s eval­u­at­ing ac­tions on my pre­sent probu­til­ity. So now the an­swer seems to de­pend on whether we al­low “tell me the right an­swer” as a prim­i­tive ac­tion, or if it is eval­u­ated as “tell me [String],” which has low probu­til­ity.

But of course, if tell me the right an­swer is prim­i­tive, how do we stop “do the right thing” from be­ing prim­i­tive, which lands us right back in the hot wa­ter of strong op­ti­miza­tion of ‘util­ity’ this pro­posal was sup­posed to pre­vent? So I think it should eval­u­ate the spe­cific out­put, which has low prob­a­bil­ity(hu­man), and there­fore not tell you.

• I’ll try and write up a proof that it can do what I think it can.