# Policy Approval

(ETA: The name “policy ap­proval” wasn’t great. I think I will use the term “policy align­ment” to con­trast with “value align­ment” go­ing for­ward, at the sug­ges­tion of Wei Dai in the com­ments.)

I re­cently had a con­ver­sa­tion with Stu­art Arm­strong in which I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” really is a well-defined thing) and at­tempts to op­tim­ize it is still not per­fectly aligned with you. He chal­lenged me to write up spe­cific ex­amples to back up my claims.

I’ll also give a very sketchy al­tern­at­ive to value learn­ing, which I call policy ap­proval. (The policy ap­proval idea emerged out of a con­ver­sa­tion with Andrew Critch.)

# Background

Stu­art Arm­strong has re­cently been do­ing work show­ing the dif­fi­culty of in­fer­ring hu­man val­ues. To sum­mar­ize: be­cause hu­mans are ir­ra­tional, a value-learn­ing ap­proach like CIRL needs to jointly es­tim­ate the hu­man util­ity func­tion and the de­gree to which the hu­man is ra­tional—oth­er­wise, it would take all the mis­takes hu­mans make to be pref­er­ences. Un­for­tu­nately, this leads to a severe prob­lem of iden­ti­fi­ab­il­ity: hu­mans can be as­signed any val­ues what­so­ever if we as­sume the right kind of ir­ra­tion­al­ity, and the usual trick of pre­fer­ring sim­pler hy­po­theses doesn’t seem to help in this case.

I also want to point out that a sim­ilar prob­lem arises even without ir­ra­tion­al­ity. Vladi­mir Nesov ex­plored how prob­ab­il­ity and util­ity can be mixed into each other without chan­ging any de­cisions an agent makes. So, in prin­ciple, we can’t de­term­ine the util­ity or prob­ab­il­ity func­tion of an agent uniquely based on the agent’s be­ha­vior alone (even in­clud­ing hy­po­thet­ical be­ha­vior in coun­ter­fac­tual situ­ations). This fact was dis­covered earlier by Jef­frey and Bolker, and is ana­lyzed in more de­tail in the book The Lo­gic of De­cision. For this reason, I call the trans­form “Jef­frey-Bolker ro­ta­tion”.

To give an il­lus­trat­ive ex­ample: it doesn’t mat­ter whether we as­sign very low prob­ab­il­ity to an event, or care very little about what hap­pens given that event. Sup­pose a love-max­im­iz­ing agent is un­able to as­sign nonzero util­ity to a uni­verse where love isn’t real. The agent may ap­pear to ig­nore evid­ence that love isn’t real. We can in­ter­pret this as not caring what hap­pens con­di­tioned on love not be­ing real; or, equally valid (in terms of the ac­tions which the agent chooses), we can in­ter­pret the agent as hav­ing an ex­tremely low prior prob­ab­il­ity on love not be­ing real.

At MIRI, we some­times use the term “probutil­ity” to in­dic­ate the prob­ab­il­ity,util­ity pair in a way which re­minds us that they can’t be dis­en­tangled from one an­other. Jef­frey-Bolker ro­ta­tion changes prob­ab­il­it­ies and util­it­ies, but does not change the over­all probutil­it­ies.

Given these prob­lems, it would be nice if we did not ac­tu­ally need to learn the hu­man util­ity func­tion. I’ll ad­voc­ate that po­s­i­tion.

My un­der­stand­ing is that Stu­art Arm­strong is op­tim­istic that hu­man val­ues can be in­ferred des­pite these prob­lems, be­cause we have a lot of use­ful prior in­form­a­tion we can take ad­vant­age of.

It is in­tu­it­ive that a CIRL-like agent should learn what is ir­ra­tional and then “throw it out”, IE, de-noise hu­man pref­er­ences by look­ing only at what we really prefer, not at what we mis­takenly do out of short-sighted­ness or other mis­takes. On the other hand, it is not so ob­vi­ous that the prob­ab­il­ity/​util­ity dis­tinc­tion should be handled in the same way. Should an agent dis­en­tangle be­liefs from pref­er­ences just so that it can throw out hu­man be­liefs and op­tim­ize the pref­er­ences alone? I ar­gue against this here.

# Main Claim

Ignor­ing is­sues of ir­ra­tion­al­ity or bounded ra­tion­al­ity, what an agent wants out of a helper agent is that the helper agent does pre­ferred things.

Sup­pose a ro­bot is try­ing to help a per­fectly ra­tional hu­man. The hu­man has prob­ab­il­ity func­tion and util­ity func­tion . The ro­bot is in epi­stemic state e. The ro­bot has a set of ac­tions . The pro­pos­i­tion “the ro­bot takes the ith ac­tion when in epi­stemic state e” is writ­ten as . The set of full world-states is S. What the hu­man would like the ro­bot to do is given by:

(Or by the ana­log­ous causal coun­ter­fac­tual, if the hu­man thinks that way.)

This no­tion of what the hu­man wants is in­vari­ant to Jef­frey-Bolker ro­ta­tion; the ro­bot doesn’t need to dis­en­tangle prob­ab­il­ity and util­ity! It only needs to learn probutil­it­ies.

The equa­tion writ­ten above can’t be dir­ectly op­tim­ized, since the ro­bot doesn’t have dir­ect ac­cess to hu­man probutil­it­ies. However, I’ll broadly call any at­tempt to ap­prox­im­ate that equa­tion “policy ap­proval”.

Notice that this is closely ana­log­ous to UDT. UDT solves dy­namic in­con­sist­en­cies—situ­ations in which an AI could pre­dict­ably dis­like the de­cisions of its fu­ture self—by op­tim­iz­ing its ac­tions from the per­spect­ive of a fixed prior, IE, its ini­tial self. Policy ap­proval re­solves in­con­sist­en­cies between the AI and the hu­man by op­tim­iz­ing the AI’s ac­tions from the hu­man’s per­spect­ive. The main point of this post is that we can use this ana­logy to pro­duce counter­examples to the typ­ical value-learn­ing ap­proach, in which the AI tries to op­tim­ize hu­man util­ity but not ac­cord­ing to hu­man be­liefs.

I will some­what ig­nore the dis­tinc­tion between UDT1.0 and UDT1.1.

# Examples

These ex­amples serve to il­lus­trate that “op­tim­iz­ing hu­man util­ity ac­cord­ing to AI be­liefs” is not ex­actly the same as “do what the hu­man would want you to do”, even when we sup­pose “the hu­man util­ity func­tion” is per­fectly well-defined and can be learned ex­actly by the AI.

In these ex­amples, I will sup­pose that the AI has its own prob­ab­il­ity dis­tri­bu­tion . It reas­ons up­date­lessly with re­spect to evid­ence e it sees, but with full prior know­ledge of the hu­man util­ity func­tion:

I use an up­date­less agent to avoid ac­cus­a­tions that of course an up­date­ful agent would fail clas­sic UDT prob­lems. However, it is not really very im­port­ant for the ex­amples.

I as­sume prior know­ledge of to avoid any tricky is­sues which might arise by at­tempt­ing to com­bine up­date­less­ness with value learn­ing.

## Coun­ter­fac­tual Mugging

It seems reas­on­able to sup­pose that the AI will start out with some math­em­at­ical know­ledge. Ima­gine that the AI has a data­base of the­or­ems in memory when it boots up, in­clud­ing the first mil­lion di­gits of pi. Treat these as part of the agent’s prior.

Sup­pose, on the other hand, that the hu­man which the AI wants to help does not know more than a hun­dred di­gits of pi.

The hu­man and the AI will dis­agree on what to do about coun­ter­fac­tual mug­ging with a lo­gical coin in­volving di­gits of pi which the AI knows and the hu­man does not. If Omega ap­proaches the AI, the AI will re­fuse to par­ti­cip­ate, but the hu­man will wish the AI would. If Omega ap­proaches the hu­man, the AI may try to pre­vent the hu­man from par­ti­cip­at­ing, to the ex­tent that it can do so without vi­ol­at­ing other as­pects of the hu­man util­ity func­tion.

## “Too Up­date­less”

Maybe the prob­lem with the coun­ter­fac­tual mug­ging ex­ample is that it doesn’t make sense to pro­gram the AI with a bunch of know­ledge in its prior which the hu­man doesn’t have.

We can go in the op­pos­ite ex­treme, and make a broad prior such as the So­lomonoff dis­tri­bu­tion, with no in­form­a­tion about our world in par­tic­u­lar.

I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird res­ults. There could be a world with higher prior prob­ab­il­ity than ours, in­hab­ited by Omegas who ask the AI to op­tim­ize alien val­ues in most uni­verses (in­clud­ing Earth) in ex­change for the Omegas max­im­iz­ing in their own world. (This par­tic­u­lar scen­ario doesn’t seem par­tic­u­larly prob­able, but it does seem quite plaus­ible that some weird uni­verses will have higher prob­ab­il­ity than our uni­verse in the So­lomonoff prior, and may make some such bar­gain.)

Again, this is some­thing which can hap­pen in the max­im­iz­a­tion us­ing but not in the one us­ing -- un­less hu­mans them­selves would ap­prove of the mul­tiver­sal bar­gain.

## “Just Hav­ing a Very Dif­fer­ent Prior”

Maybe is neither strictly more know­ledgable than nor less, but the two are very dif­fer­ent on some spe­cific is­sues. Per­haps there’s a spe­cific plan which, when is con­di­tioned on evid­ence so far, looks very likely to have many good con­sequences. con­siders the plan very likely to have many bad con­sequences. Also sup­pose that there aren’t any in­ter­est­ing con­sequences of this plan in coun­ter­fac­tual branches, so UDT con­sid­er­a­tions don’t come in.

Also, sup­pose that there isn’t time to test the dif­fer­ing hy­po­theses in­volved which make hu­mans think this is such a bad plan while AIs think it is so good. The AI has to de­cide right now whether to en­act the plan.

The value-learn­ing agent will im­ple­ment this plan, since it seems good on net for hu­man val­ues. The policy-ap­proval agent will not, since hu­mans wouldn’t want it to.

Ob­vi­ously, one might ques­tion whether it is reas­on­able to as­sume that things got to a point where there was such a large dif­fer­ence of opin­ion between the AI and the hu­mans, and no time to re­solve it. Ar­gu­ably, there should be safe­guards against this scen­ario which the value-learn­ing AI it­self would want to set up, due to facts about hu­man val­ues such as “the hu­mans want to be in­volved in big de­cisions about their fu­ture” or the like.

Non­ethe­less, faced with this situ­ation, it seems like policy-ap­proval agents do the right thing while value-learn­ing agents do not.

# Is­sues/​Objections

Isn’t it prob­lem­atic to op­tim­ize via hu­man be­liefs, since hu­man be­liefs are low-qual­ity?

I think this is some­what true and some­what not.

• Partly, this is like say­ing “isn’t UDT bad be­cause it doesn’t learn?”—ac­tu­ally, UDT acts as if it up­dates most of the time, so it is wrong to think of it as in­cap­able of learn­ing. Sim­il­arly, al­though the policy-ap­proval agent uses , it will mostly act as if it has up­dated on a lot of in­form­a­tion. So, maybe you be­lieve hu­man be­liefs aren’t very good—but do you think we’re cap­able of learn­ing al­most any­thing even­tu­ally? If so, this may ad­dress a large com­pon­ent of the con­cern. In par­tic­u­lar, if you trust the out­put of cer­tain ma­chine learn­ing al­gorithms more than you trust your­self, the AI can run those al­gorithms and use their out­put.

• On the other hand, hu­mans prob­ably have in­co­her­ent , and not just be­cause of lo­gical un­cer­tainty. So, the AI still needs to fig­ure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

## If hu­mans would want an AI to op­tim­ize via hu­man be­liefs, won’t that be re­flec­ted in the hu­man util­ity func­tion?

Or: If policy-ap­proval were good, wouldn’t a value-learner self modify into policy-ap­proval any­way?

I don’t think this is true, but I’m not sure. Cer­tainly there could be simple agents who value-learners co­oper­ate with without ever de­cid­ing to self-modify into policy-ap­proval agents. Per­haps there is some­thing about hu­man pref­er­ence which de­sires the AI to co­oper­ate with the hu­man even when the AI thinks this is (oth­er­wise) net-neg­at­ive for hu­man val­ues.

## Aren’t I ig­nor­ing the fact that the AI needs its own be­liefs?

In “Just Hav­ing a Very Dif­fer­ent Prior”, I claimed that if and dis­agree about the con­sequences of a plan, value-learn­ing can do some­thing hu­mans strongly don’t want it to do, whereas policy-ap­proval can­not. However, my defin­i­tion of policy-ap­proval ig­nores learn­ing. Real­ist­ic­ally, the policy-ap­proval agent needs to also have be­liefs , which it uses to ap­prox­im­ate the hu­man ap­proval of its ac­tions. Can’t the same large dis­agree­ment emerge from this?

I think the con­cern is qual­it­at­ively less, be­cause the policy-ap­proval agent uses only to es­tim­ate and . If the AI knows that hu­mans would have a large dis­agree­ment with the plan, the policy-ap­proval agent would not im­ple­ment the plan, while the value-learn­ing agent would.

For policy-ap­proval to go wrong, it needs to have a bad es­tim­ate of and .

## The policy is too big.

Even if the pro­cess of learn­ing is do­ing the work to turn it into a co­her­ent prob­ab­il­ity dis­tri­bu­tion (re­mov­ing ir­ra­tion­al­ity and mak­ing things well-defined), it still may not be able to con­ceive of im­port­ant pos­sib­il­it­ies. The evid­ence which the AI uses to de­cide how to act, in the equa­tions given earlier, may be a large data stream with some hu­man-in­com­pre­hens­ible parts.

As a res­ult, it seems like the AI needs to op­tim­ize over com­pact/​ab­stract rep­res­ent­a­tions of its policy, sim­il­arly to how policy se­lec­tion in lo­gical in­duct­ors works.

This isn’t an en­tirely sat­is­fact­ory an­swer, since (1) the rep­res­ent­a­tion of a policy as a com­puter pro­gram could still es­cape hu­man un­der­stand­ing, and (2) it is un­clear what it means to cor­rectly rep­res­ent the policy in a hu­man-un­der­stand­able way.

## Terminology

Aside from is­sues with the ap­proach, my term “policy ap­proval” may be ter­rible. It sounds too much like “ap­proval-dir­ec­ted agent”, which means some­thing dif­fer­ent. I think there are sim­il­ar­it­ies, but they aren’t strong enough to jus­tify re­fer­ring to both as “ap­proval”. Any sug­ges­tions?

(These are very spec­u­lat­ive.)

## Lo­gical Up­date­less­ness?

One of the ma­jor obstacles to pro­gress in de­cision the­ory right now is that we don’t know of a good up­date­less per­spect­ive for lo­gical un­cer­tainty. Maybe a policy-ap­proval agent doesn’t need to solve this prob­lem, since it tries to op­tim­ize from the hu­man per­spect­ive rather than its own. Roughly: lo­gical up­date­less­ness is hard be­cause it tends to fall into the “too up­date­less” is­sue above. So, maybe it can be a non-is­sue in the right for­mu­la­tion of policy ap­proval.

## Cor­ri­gib­il­ity?

Stu­art Arm­strong is some­what pess­im­istic about cor­ri­gib­il­ity. Per­haps there is some­thing which can be done in policy-ap­proval land which can’t be done oth­er­wise. The “Just Hav­ing Very Dif­fer­ent Pri­ors” ex­ample points in this dir­ec­tion; it is an ex­ample where policy-ap­proval acts in a much more cor­ri­gible way.

A value-learn­ing agent can al­ways res­ist hu­mans if it is highly con­fid­ant that its plan is a good one which hu­mans are op­pos­ing ir­ra­tion­ally. A policy-ap­proval agent can think its plan is a good one but also think that hu­mans would prefer it to be cor­ri­gible on prin­ciple re­gard­less of that.

On the other hand, a policy-ap­proval agent isn’t guar­an­teed to think that. Per­haps policy-ap­proval learn­ing can be spe­cified with some kind of highly cor­ri­gible bias, so that it re­quires a lot of evid­ence to de­cide that hu­mans don’t want it to be­have cor­ri­gibly in a par­tic­u­lar case?

# Conclusion

I’ve left out some spec­u­la­tion about what policy-ap­proval agents should ac­tu­ally look like, for the sake of keep­ing mostly to the point (the dis­cus­sion with Stu­art). I like this idea be­cause it in­volves a change in per­spect­ive of what an agent should be, sim­ilar to the change which UDT it­self made.

• I’m con­fused. If the AI knows a mil­lion di­gits of pi, and it can pre­vent Omega from coun­ter­fac­tu­ally mug­ging me where it knows I will lose money… shouldn’t it try to pre­vent that from hap­pen­ing? That seems like the right be­ha­vior to me. Sim­il­arly, if I knew that the AI knows a mil­lion di­gits of pi, then if it gets coun­ter­fac­tu­ally mugged, it shouldn’t give up the money.

(Per­haps the ar­gu­ment is that as long as Omega was un­cer­tain about the di­git when de­cid­ing what game to pro­pose, then you should pay up as ne­ces­sary, re­gard­less of what you know. But if that’s the ar­gu­ment, then why can’t the AI go through the same reas­on­ing?)

Ignor­ing is­sues of ir­ra­tion­al­ity or bounded ra­tion­al­ity, what an agent wants out of a helper agent is that the helper agent does pre­ferred things.

If the AI knows the win­ning num­bers for the lot­tery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the win­ning num­bers) I would dis­prefer that ac­tion. Even bet­ter would be if it ex­plained to me what it was do­ing, after which I would prefer the ac­tion, but let’s say that wasn’t pos­sible for some reason (maybe it per­formed a very com­plex sim­u­la­tion of the world to fig­ure out the win­ning num­ber).

It seems like if the AI knows my util­ity func­tion and is op­tim­iz­ing it, that does per­form well. Now for prac­tical reas­ons, we prob­ably want to in­stead build an AI that does what we prefer it to do, but this seems to be be­cause it would be hard to learn the right util­ity func­tion, and er­rors along the way could lead to cata­strophe, not be­cause it would be bad for the AI to op­tim­ize the right util­ity func­tion.

ETA: My straw­man-ML-ver­sion of your ar­gu­ment is that you would prefer im­it­a­tion learn­ing in­stead of in­verse re­in­force­ment learn­ing (which dif­fer when the AI and hu­man know dif­fer­ent things). This seems wrong to me.

• I’m con­fused. If the AI knows a mil­lion di­gits of pi, and it can pre­vent Omega from coun­ter­fac­tu­ally mug­ging me where it knows I will lose money… shouldn’t it try to pre­vent that from hap­pen­ing? That seems like the right be­ha­vior to me. Sim­il­arly, if I knew that the AI knows a mil­lion di­gits of pi, then if it gets coun­ter­fac­tu­ally mugged, it shouldn’t give up the money.

If you don’t think one should pay up in coun­ter­fac­tual mug­ging in gen­eral, then my ar­gu­ment won’t land. Rather than ar­guing that you want to be coun­ter­fac­tu­ally mugged, I’ll try and ar­gue a dif­fer­ent de­cision prob­lem.

Sup­pose that Omega is run­ning a fairly simple and quick al­gorithm which is non­ethe­less able to pre­dict an AI with more pro­cessing power, due to us­ing a stronger lo­gic or sim­ilar tricks. Omega will put either $10 or$1000 in a box. Our AI can press a but­ton on the box to get either all or half of the money in­side. Omega puts in $1000 if it pre­dicts that our AI will take half the money; oth­er­wise, it puts in$10.

We sup­pose that, since there is a short proof of ex­actly what Omega does, it is already present in the math­em­at­ical data­base in­cluded in the AI’s prior.

If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pec­ted util­ity. So, it will get only $10 from Omega. If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pect­a­tion in the hu­man’s ex­pect­a­tion: tak­ing half, or tak­ing it all. It’s quite pos­sible in this case that it takes all the money. (Per­haps the ar­gu­ment is that as long as Omega was un­cer­tain about the di­git when de­cid­ing what game to pro­pose, then you should pay up as ne­ces­sary, re­gard­less of what you know. But if that’s the ar­gu­ment, then why can’t the AI go through the same reas­on­ing?) That is part of the ar­gu­ment for pay­ing up in coun­ter­fac­tual mug­ging, yes. But both us and Omega need to be un­cer­tain about the di­git, since if our prior can already pre­dict that Omega is go­ing to ask us for$10 rather than give us any money, there’s no reason for us to pay up. So, it de­pends on the prior, and can turn out dif­fer­ently if our vs the agent’s prior is used.

If the AI knows the win­ning num­bers for the lot­tery, then it should buy that ticket for me, even though (if I don’t know that the AI knows the win­ning num­bers) I would dis­prefer that ac­tion.

If I think that the AI tends to be mis­cal­ib­rated about lot­tery-ticket be­liefs, there is no reason for me to want it to buy the ticket. If I think it is cal­ib­rated about lot­tery-tir­ket be­liefs, I’ll like the policy of buy­ing lot­tery tick­ets in such cases, so the AI will buy.

You could ar­gue that an AI which is try­ing to be help­ful will buy lot­tery tick­ets in such cases no mat­ter how de­luded the hu­mans think it is. But, not only is this not very cor­ri­gible be­ha­vior, but also it doesn’t make any sense from our per­spect­ive to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to be­lieve are un­re­li­able.

ETA: My straw­man-ML-ver­sion of your ar­gu­ment is that you would prefer im­it­a­tion learn­ing in­stead of in­verse re­in­force­ment learn­ing (which dif­fer when the AI and hu­man know dif­fer­ent things). This seems wrong to me.

The ana­logy isn’t per­fect, since the AI can still do things to max­im­ize hu­man ap­proval which the hu­man would never have thought of, as well as things which the hu­man could think of but didn’t have the com­pu­ta­tional re­sources to do. It does seem like a fairly good ana­logy, though.

• Okay, I think I mis­un­der­stood what you were claim­ing in this post. Based on the fol­low­ing line:

I claimed that an agent which learns your util­ity func­tion (pre­tend­ing for a mo­ment that “your util­ity func­tion” really is a well-defined thing) and at­tempts to op­tim­ize it is still not per­fectly aligned with you.

I thought you were ar­guing, “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that per­fectly op­tim­izes this true util­ity func­tion is still not aligned with you.” (Yes, hav­ing writ­ten it down I can see that is not what you ac­tu­ally said, but that’s the in­ter­pret­a­tion I ori­gin­ally ended up with.)

I would now re­ph­rase your claim as “Even as­sum­ing we know the true util­ity func­tion, op­tim­iz­ing it is hard.”

Examples:

You could ar­gue that an AI which is try­ing to be help­ful will buy lot­tery tick­ets in such cases no mat­ter how de­luded the hu­mans think it is. But, not only is this not very cor­ri­gible be­ha­vior, but also it doesn’t make any sense from our per­spect­ive to make an AI reason in that way: we don’t want the AI to act in ways which we have good reason to be­lieve are un­re­li­able.

Yeah, an AI that op­tim­izes the true util­ity func­tion prob­ably won’t be cor­ri­gible. From a the­or­et­ical stand­point, that seems fine—cor­ri­gib­il­ity seems like an easier tar­get to shoot for, not a ne­ces­sary as­pect of an aligned AI. The reason we don’t want the scen­ario above is “we have good reason to be­lieve [the AI is] un­re­li­able”, which sounds like the AI is fail­ing to op­tim­ize the util­ity func­tion cor­rectly.

If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pec­ted util­ity. So, it will get only 10 from Omega. If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pect­a­tion in the hu­man’s ex­pect­a­tion: tak­ing half, or tak­ing it all. It’s quite pos­sible in this case that it takes all the money. This also sounds like the value-learn­ing agent is simply bad at cor­rectly op­tim­iz­ing the true util­ity func­tion. (It seems to me that all of de­cision the­ory is about how to prop­erly op­tim­ize a util­ity func­tion in the­ory.) We can go in the op­pos­ite ex­treme, and make PR a broad prior such as the So­lomonoff dis­tri­bu­tion, with no in­form­a­tion about our world in par­tic­u­lar. I be­lieve the ob­ser­va­tion has been made be­fore that run­ning UDT on such a prior could have weird res­ults. Again, seems like this pro­posal for mak­ing an aligned AI is just bad at op­tim­iz­ing the true util­ity func­tion. So I guess the way I would sum­mar­ize this post: • Value learn­ing is hard. • Even if you know the cor­rect util­ity func­tion, op­tim­iz­ing it is hard. • In­stead of try­ing to value learn and then op­tim­ize, just go straight for the policy in­stead, which is safer than re­ly­ing on ac­cur­ately de­com­pos­ing a hu­man into two dif­fer­ent things that are both dif­fi­cult to learn and have weird in­ter­ac­tions with each other. Is this right? • I thought you were ar­guing, “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that per­fectly op­tim­izes this true util­ity func­tion is still not aligned with you.” (Yes, hav­ing writ­ten it down I can see that is not what you ac­tu­ally said, but that’s the in­ter­pret­a­tion I ori­gin­ally ended up with.) I would cor­rect it to “Sup­pose we knew your true util­ity func­tion ex­actly, with no er­rors. An AI that per­fectly op­tim­izes this in ex­pect­a­tion ac­cord­ing to some prior is still not aligned with you.” I would now re­ph­rase your claim as “Even as­sum­ing we know the true util­ity func­tion, op­tim­iz­ing it is hard.” This part is tricky for me to in­ter­pret. On the one hand, yes: spe­cific­ally, even if you have all the pro­cessing power you need, you still need to op­tim­ize via a par­tic­u­lar prior (AIXI op­tim­izes via So­lomonoff in­duc­tion) since you can’t dir­ectly see what the con­sequences of your ac­tions will be. So, I’m spe­cific­ally point­ing at an as­pect of “op­tim­iz­ing it is hard” which is about hav­ing a good prior. You could say that “util­ity” is the true tar­get, and “ex­pec­ted util­ity” is the proxy which you have to use in de­cision the­ory. On the other hand, this might be a mis­lead­ing way of fram­ing the prob­lem. It sug­gests that some­thing with a per­fect prior (ma­gic­ally ex­actly equal to the uni­verse we’re ac­tu­ally in) would be per­fectly aligned: “If you know the true util­ity func­tion, and you know the true state of the uni­verse and con­sequences of al­tern­at­ive ac­tions you can take, then you are aligned.” This isn’t ne­ces­sar­ily ob­jec­tion­able, but it is not the no­tion of align­ment in the post. • If the AI ma­gic­ally has the “true uni­verse” prior, this gives hu­mans no reason to trust it. The hu­mans might reas­on­ably con­clude that it is over­con­fid­ent, and want to shut it down. If it jus­ti­fi­ably has the true uni­verse prior, and can ex­plain why the prior must be right in a way that hu­mans can un­der­stand, then the AI is aligned in the sense of the post. • The Jef­frey-Bolker ro­ta­tion (men­tioned in the post) gives me some reason to think of the prior and the util­ity func­tion as one ob­ject, so that it doesn’t make sense to think about “the true hu­man util­ity func­tion” in isol­a­tion. None of my choice be­ha­vior (be it re­vealed pref­er­ences or verbally claimed pref­er­ences etc) can dif­fer­en­ti­ate between me as­sign­ing small prob­ab­il­ity to a set of pos­sib­il­it­ies (but caring mod­er­ately about what hap­pens in those pos­sib­il­it­ies) and as­sign­ing a mod­er­ate prob­ab­il­ity (but caring very little what hap­pens one way or an­other in those worlds). So, I’m not even sure it is sens­ible to think of alone as cap­tur­ing hu­man pref­er­ences; maybe doesn’t really make sense apart from . So, to sum­mar­ize, 1. I agree that “even as­sum­ing we know the true util­ity func­tion, op­tim­iz­ing it is hard”—but I am spe­cific­ally point­ing at the fact that we need be­liefs to sup­ple­ment util­ity func­tions, so that we can max­im­ize ex­pec­ted util­ity as a proxy for util­ity. And this proxy can be bad. 2. Even un­der the ideal­ized as­sump­tion that hu­mans are per­fectly co­her­ent de­cision-the­or­etic agents, I’m not sure it makes sense to say there’s a “true hu­man util­ity func­tion”—the VNM the­orem only gets a which is unique up to such-and-such by as­sum­ing a fixed no­tion of prob­ab­il­ity. The Jef­frey-Bolker rep­res­ent­a­tion the­orem, which jus­ti­fies ra­tional agents hav­ing prob­ab­il­ity and util­ity func­tions in one the­orem rather than jus­ti­fy­ing the two in­de­pend­ently, shows that we can do this “ro­ta­tion” which shifts which part of the pref­er­ences are rep­res­en­ted in the prob­ab­il­ity vs in the util­ity, without chan­ging the un­der­ly­ing pref­er­ences. 3. If we think of the ob­ject­ive as “build­ing AI such that there is a good ar­gu­ment for hu­mans trust­ing that the AI has hu­man in­terest in mind” rather than “build­ing AI which op­tim­izes hu­man util­ity”, then we nat­ur­ally want to solve #1 in a way which takes hu­man be­liefs into ac­count. This ad­dresses the con­cern from #2; we don’t ac­tu­ally have to fig­ure out which part of pref­er­ences are “prob­ab­il­ity” vs “util­ity”. • It sug­gests that some­thing with a per­fect prior (ma­gic­ally ex­actly equal to the uni­verse we’re ac­tu­ally in) would be per­fectly aligned: “If you know the true util­ity func­tion, and you know the true state of the uni­verse and con­sequences of al­tern­at­ive ac­tions you can take, then you are aligned.” This isn’t ne­ces­sar­ily ob­jec­tion­able, but it is not the no­tion of align­ment in the post. If the AI ma­gic­ally has the “true uni­verse” prior, this gives hu­mans no reason to trust it. The hu­mans might reas­on­ably con­clude that it is over­con­fid­ent, and want to shut it down. If it jus­ti­fi­ably has the true uni­verse prior, and can ex­plain why the prior must be right in a way that hu­mans can un­der­stand, then the AI is aligned in the sense of the post. Sure. I was claim­ing that it is also a reas­on­able no­tion of align­ment. My reason for not us­ing that no­tion of align­ment is that it doesn’t seem prac­tic­ally real­iz­able. However, if we could ma­gic­ally give the AI the “true uni­verse” prior with the “true util­ity func­tion”, I would be happy and say we were done, even if it wasn’t jus­ti­fi­able and couldn’t ex­plain it to hu­mans. I agree it would not be aligned in the sense of the post. So, I’m not even sure it is sens­ible to think of UH alone as cap­tur­ing hu­man pref­er­ences; maybe UH doesn’t really make sense apart from PH. This seems to ar­gue that if my AI knew the win­ning lot­tery num­bers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the win­ning lot­tery ticket. I agree the Jef­frey-Bolker ro­ta­tion seems to in­dic­ate that we should think of probutil­it­ies in­stead of prob­ab­il­it­ies and util­it­ies sep­ar­ately, but it seems like there really are some very clear ac­tual dif­fer­ences in the real world, and we should ac­count for it some­how. Per­haps one dif­fer­ence is that prob­ab­il­it­ies change in re­sponse to new in­form­a­tion, whereas (ideal­ized) util­ity func­tions don’t. (Ob­vi­ously hu­mans don’t have ideal­ized util­ity func­tions, but this is all a the­or­et­ical ex­er­cise any­way.) I agree that “even as­sum­ing we know the true util­ity func­tion, op­tim­iz­ing it is hard”—but I am spe­cific­ally point­ing at the fact that we need be­liefs to sup­ple­ment util­ity func­tions, so that we can max­im­ize ex­pec­ted util­ity as a proxy for util­ity. And this proxy can be bad. Thanks for cla­ri­fy­ing, that’s clearer to me now. If we think of the ob­ject­ive as “build­ing AI such that there is a good ar­gu­ment for hu­mans trust­ing that the AI has hu­man in­terest in mind” rather than “build­ing AI which op­tim­izes hu­man util­ity”, then we nat­ur­ally want to solve #1 in a way which takes hu­man be­liefs into ac­count. This ad­dresses the con­cern from #2; we don’t ac­tu­ally have to fig­ure out which part of pref­er­ences are “prob­ab­il­ity” vs “util­ity”. I gen­er­ally agree with the ob­ject­ive you pro­pose (for prac­tical reas­ons). The ob­vi­ous way to do this is to do im­it­a­tion learn­ing, where (to a first ap­prox­im­a­tion) you just copy the hu­man’s policy. (Or al­tern­at­ively, have the policy that a hu­man would ap­prove of you hav­ing.) This won’t let you ex­ceed hu­man in­tel­li­gence, which seems like a pretty big prob­lem. Do you ex­pect an AI us­ing policy align­ment to do bet­ter than hu­mans at tasks? If so, how is it do­ing bet­ter? My nor­mal an­swer to this in the EV frame­work is “it has bet­ter es­tim­ates of prob­ab­il­it­ies of fu­ture states”, but we can’t do that any more. Per­haps you’re hop­ing that the AI can ex­plain its plan to a hu­man, and the hu­man will then ap­prove of it even though they wouldn’t have be­fore the ex­plan­a­tion. In that case, the hu­man’s probutil­it­ies have changed, which means that policy align­ment is now “align­ment to a thing that I can ma­nip­u­late”, which seems bad. Fwiw I am gen­er­ally in fa­vor of ap­proaches along the lines of policy align­ment, I’m more con­fused about the the­ory be­hind it here. • I’m not even sure whether you are closer or fur­ther from un­der­stand­ing what I meant, now. I think you are prob­ably closer, but stat­ing it in a way I wouldn’t. I see that I need to do some care­ful dis­am­big­u­ation of back­ground as­sump­tions and lan­guage. In­stead of try­ing to value learn and then op­tim­ize, just go straight for the policy in­stead, which is safer than re­ly­ing on ac­cur­ately de­com­pos­ing a hu­man into two dif­fer­ent things that are both dif­fi­cult to learn and have weird in­ter­ac­tions with each other. This part, at least, is get­ting at the same in­tu­ition I’m com­ing from. However, I can only as­sume that you are con­fused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talk­ing much about dir­ectly learn­ing the policies. (I am think­ing I’ll write an­other post to make that con­nec­tion clearer.) I will have to think harder about the dif­fer­ence between how you’re fram­ing things and how I would frame things, to try to cla­rify more. • I’m not even sure whether you are closer or fur­ther from un­der­stand­ing what I meant, now. :( I can only as­sume that you are con­fused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talk­ing much about dir­ectly learn­ing the policies. My as­sump­tion was that you were ar­guing for why learn­ing policies dir­ectly (as­sum­ing we could do it) has ad­vant­ages over the de­fault ap­proach of value learn­ing + op­tim­iz­a­tion. That fram­ing seems to ex­plain most of the post. • It’s quite pos­sible in this case that it takes all the money. Did you mean to say that it’s quite pos­sible that it takes half the money? • Se­par­ately, I still don’t un­der­stand the coun­ter­fac­tual mug­ging case. (Dis­claimer, I haven’t gone through any math around coun­ter­fac­tual mug­ging.) It seems really strange that if the hu­man was cer­tain about the di­git, they wouldn’t pay up, but if the hu­man is un­cer­tain about the di­git but is cer­tain that the AI knows the di­git, then the hu­man would not want the AI to in­ter­vene. But pos­sibly it’s not worth get­ting into this de­tail. Omega will put either10 or $1000 in a box. Our AI can press a but­ton on the box to get either all or half of the money in­side. Omega puts in$1000 if it pre­dicts that our AI will take half the money; oth­er­wise, it puts in $10. We sup­pose that, since there is a short proof of ex­actly what Omega does, it is already present in the math­em­at­ical data­base in­cluded in the AI’s prior. If the AI is a value-learn­ing agent, it will take all the money, since it already knows how much money there is—tak­ing less money just has a lower ex­pec­ted util­ity. So, it will get only$10 from Omega.
If the AI is a policy-ap­proval agent, it will think about what would have a higher ex­pect­a­tion in the hu­man’s ex­pect­a­tion: tak­ing half, or tak­ing it all. It’s quite pos­sible in this case that it takes all the money.

I think as­sum­ing that you have ac­cess to the proof of what Omega does means that you have already de­term­ined your own be­ha­vior. Pre­sum­ably, “what Omega does” de­pends on your own policy, so if you have a proof about what Omega does, that proof also de­term­ines your ac­tion, and there is noth­ing left for the agent to con­sider.

To be clear, I think it’s reas­on­able to con­sider AIs that try to fig­ure out proofs of “what Omega does”, but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acaus­ally) in­flu­ence what Omega does. And if it’s not part of the prior, then I think a value-learn­ing agent with a good de­cision the­ory can get the $500. • I think as­sum­ing that you have ac­cess to the proof of what Omega does means that you have already de­term­ined your own be­ha­vior. You may not re­cog­nize it as such, es­pe­cially if Omega is us­ing a dif­fer­ent ax­iom sys­tem than you. So, you can still be ig­nor­ant of what you’ll do while know­ing what Omega’s pre­dic­tion of you is. This makes it im­possible for your prob­ab­il­ity dis­tri­bu­tion to treat the two as cor­rel­ated any­more. but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acaus­ally) in­flu­ence what Omega does Yeah, that’s the prob­lem here. And if it’s not part of the prior, then I think a value-learn­ing agent with a good de­cision the­ory can get the$500.

Only if the agent takes that one proof out of the prior, but still has enough struc­ture in the prior to see how the de­cision prob­lem plays out. This is the prob­lem of con­struct­ing a thin prior. You can (more or less) solve any de­cision prob­lem by mak­ing the agent suf­fi­ciently up­date­less, but you run up against the prob­lem of mak­ing it too up­date­less, at which point it be­haves in ab­surd ways (lack­ing enough struc­ture to even un­der­stand the con­sequences of policies cor­rectly).

Hence the in­tu­ition that the cor­rect prior to be up­date­less with re­spect to is the hu­man one (which is, es­sen­tially, the main point of the post).

• Hey there!

A use­ful thing would be an ex­ample of when a policy ap­proval agent would do some­thing that a hu­man wouldn’t, and what gains in ef­fi­ciency the policy ap­proval agent has over a nor­mal hu­man act­ing.

I feel that the for­mu­la­tion “the hu­mans have a util­ity func­tion” may ob­scure part of what’s go­ing on. Part of the ad­vant­ages of ap­proval agents is that they al­low hu­mans to ex­press their some­times in­co­her­ent meta-pref­er­ences as well (“yeah, I want to do X, but don’t force me to do it”). As­sum­ing the hu­man pref­er­ences are already co­her­ent re­duces the at­trac­tion of the ap­proach.

• Ah, I agree that this pro­posal may have bet­ter ways to re­lax the as­sump­tion that the hu­man has a util­ity func­tion than value-learn­ing does. I wanted to fo­cus on the sim­pler case here. Per­haps I’ll write a fol­low-up post con­sid­er­ing the gen­er­al­iz­a­tion.

Maybe I’ll try to in­sert an ex­ample where the policy ap­proval agent does some­thing the hu­man wouldn’t into this post, though.

Here’s a first stab: sup­pose that the AI has a sub­routine which solves com­plex plan­ning prob­lems. Fur­ther­more, the hu­man trusts the sub­routine (does not ex­pect it to be clev­erly choos­ing plans which solve the prob­lems as stated but cause other prob­lems). The hu­man is smart enough to for­mu­late day-to-day man­age­ment prob­lems which arise at work as form­ally-spe­cified plan­ning prob­lems, and would like to be told what the an­swer to those prob­lems are. In this case, the AI will tell the hu­man those an­swers.

This also il­lus­trates a lim­ited way the policy-ap­proval agent can avoid over-op­tim­iz­ing sim­pli­fied prob­lem state­ments: if the hu­man does not trust the plan­ning sub­routine (ex­pects it to good­hart or such), then the AI will not use such a sub­routine.

(This isn’t max­im­ally sat­is­fact­ory, since the hu­man may eas­ily be mis­taken about what sub­routines to trust. I think the AI can do a little bet­ter than this, but maybe not in a way which ad­dresses the fun­da­mental is­sue.)

• Iter­ated dis­til­la­tion and amp­li­fic­a­tion seems like an ex­ample of a thing that is like policy ap­proval, and it could do lots of things that a hu­man is un­able to, such as be­com­ing really good at chess or Go. (You can ima­gine re­mov­ing the dis­til­la­tion steps if those seem too dif­fer­ent from policy ap­proval, and the point still ap­plies.)

• I think there are in­ter­est­ing con­nec­tions between HCH/​IDA and policy ap­proval, which I hope to write more about some time.

• What about call­ing it “policy align­ment” in ana­logy with “value align­ment”?

So, the AI still needs to fig­ure out what is “ir­ra­tional” and what is “real” in , just like value-learn­ing needs to do for .

Since I’m very con­fused about what my should be (I may be happy to change it in any num­ber of ways if someone gave me the cor­rect solu­tions to a bunch of philo­soph­ical prob­lems), there may not be any­thing “real” in my that I’d want an AI to learn and use in an un­crit­ical way. It seems like this mostly comes down to what prob­ab­il­it­ies really are: if prob­ab­il­it­ies are some­thing ob­ject­ive like “how real” or “how much ex­ist­ence” each pos­sible world is/​has, then I’d want an AI to use its greater in­tel­lect to fig­ure out what is the cor­rect prior and use that, but if prob­ab­il­it­ies are some­thing sub­ject­ive like how much I care about each pos­sible world, then maybe I’d want the AI to learn and use my . I’m kind of con­fused that you give a bunch of what seem to me to be less im­port­ant con­sid­er­a­tions on whether the AI should use my prob­ab­il­ity func­tion or its own to make de­cisions, and don’t men­tion this one.

• “Policy align­ment” seems like an im­prove­ment, es­pe­cially since “policy ap­proval” in­vokes gov­ern­ment policy.

With re­spect to the rest:

On the one hand, I’m temp­ted to say that to the ex­tent you re­cog­nize how con­fused you are about what prob­ab­il­it­ies are, and that this con­fu­sion has to do with how you reason in the real world, your is go­ing to change a lot when up­dated on cer­tain philo­soph­ical ar­gu­ments. As a res­ult, op­tim­iz­ing a strategy up­date­lessly via is go­ing to take that into ac­count, shift­ing be­ha­vior sig­ni­fic­antly in con­tin­gen­cies in which vari­ous philo­soph­ical ar­gu­ments emerge, and po­ten­tially put­ting a sig­ni­fic­ant amount of pro­cessing power to­ward search­ing for such ar­gu­ments.

On the other hand, I buy my “policy align­ment” pro­posal only to the ex­tent that I buy UDT, which is not en­tirely. I don’t know how to think about UDT to­gether with the shift­ing prob­ab­il­it­ies which come from lo­gical in­duc­tion. The prob­lem is sim­ilar to the one you out­line: just as it is un­clear that a hu­man should think its own has any use­ful con­tent which should be locked in forever in an up­date­less reasoner, it is sim­il­arly un­clear that a fixed lo­gical in­ductor state (after run­ning for a fi­nite amount of time) has any use­ful con­tent which one would want to lock in forever.

I don’t yet know how to think about this prob­lem. I sus­pect there’s some­thing non-ob­vi­ous to be said about the ex­tent to which trusts other be­lief dis­tri­bu­tions (IE, some­thing at least a bit more com­pel­ling than the an­swer I gave first, but not en­tirely dif­fer­ent in form).

• I was really sur­prised that the “back­ground prob­lem” is al­most the same prob­lem as in value learn­ing in some for­mu­la­tions of bounded ra­tion­al­ity. In in­form­a­tion-the­or­etic bounded ra­tion­al­ity form­al­ism, the bounded agent acts based on com­bin­a­tion of prior (rep­res­ent­ing pre­vi­ous know­ledge) and util­it­ies (what the agent wants). (It seems in some cases of up­dat­ing hu­mans, it is pos­sible to dis­en­tangle the two.)

While the “counter­examples” to “op­tim­iz­ing hu­man util­ity ac­cord­ing to AI be­lief” show how this fails in some­what tricky cases, it seems to me it will be easy to find “counter­examples” where “policy-ap­proval agent” would fail (as com­pared to what is in­tu­it­ively good)

From an “en­gin­eer­ing per­spect­ive”, if I was forced to choose some­thing right now, it would be an AI “op­tim­iz­ing hu­man util­ity ac­cord­ing to AI be­liefs” but ask­ing for cla­ri­fic­a­tion when such choice di­verges too much from the “policy-ap­proval”.

• While the “counter­examples” to “op­tim­iz­ing hu­man util­ity ac­cord­ing to AI be­lief” show how this fails in some­what tricky cases, it seems to me it will be easy to find “counter­examples” where “policy-ap­proval agent” would fail (as com­pared to what is in­tu­it­ively good)

I agree that it’ll be easy to find counter­examples to policy-ap­proval, but I think it’ll be harder than for value-align­ment agents. We have the ad­vant­age that (in the lim­ited sense provided by the as­sump­tion that the hu­man has a co­her­ent prob­ab­il­ity and util­ity) we can prove that we “do what the hu­man would want” (in a more com­pre­hens­ive sense than we can for value align­ment).

• To me this sort of ap­proach feels like a non-starter be­cause you’re ig­nor­ing the thing that gen­er­ates the policy in fa­vor of the policy it­self, which would seem to ex­pose you to Good­hart­ing that would be even worse than the Good­hart­ing we ex­pect in terms of val­ues since policy is a grosser in­stru­ment. Is there some way in which you think this is not that case, namely that fo­cus­ing on policy align­ment would help us bet­ter avoid Good­hart­ing than is pos­sible with value align­ment?

• Even if the pro­cess of learn­ing P_H is do­ing the work to turn it into a co­her­ent prob­ab­il­ity dis­tri­bu­tion (re­mov­ing ir­ra­tion­al­ity and mak­ing things well-defined), the end res­ult may find situ­ations which the AI finds it­self in too com­plex to be con­ceived.

I had trouble pars­ing the end of this sen­tence. Is the idea that the AI might get into situ­ations that are too com­plex for the hu­mans to un­der­stand?

• Yeah. I’ve ed­ited it a bit for clar­ity.

• “Ignor­ing is­sues of ir­ra­tion­al­ity or bounded ra­tion­al­ity, what an agent wants out of a helper agent is that the helper agent does pre­ferred things.”

I don’t want a “helper agent” to do what I think I’d prefer it to do. I mean, I REALLY don’t want that or any­thing like that.

If I wanted that, I could just set it up to fol­low or­ders to the best of its un­der­stand­ing, and then or­der it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve out­comes I can’t fore­see in ways I can’t think up.

What I in­tu­it­ively want it to do is what makes me hap­pi­est with the state of the world after it’s done it. That par­tic­u­lar for­mu­la­tion may get hairy with cases where its ac­tions al­ter my pref­er­ences, but just abandon­ing every pos­sible im­prove­ment in fa­vor of my pre-ex­ist­ing guesses about de­sir­able ac­tions isn’t a sat­is­fact­ory an­swer.

• If I wanted that, I could just set it up to fol­low or­ders to the best of its un­der­stand­ing, and then or­der it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve out­comes I can’t fore­see in ways I can’t think up.

The AI here can do things which you wouldn’t think up.

For ex­ample, it could have more com­pu­ta­tional power than you to search for plans which max­im­ize ex­pec­ted util­ity ac­cord­ing to your prob­ab­il­ity and util­ity func­tions. Then, it could tell you the an­swer, if you’re the kind of per­son who likes to be told those kinds of an­swers (IE, if this doesn’t vi­ol­ate your sense of autonomy/​self-de­term­in­a­tion).

Or, if there is any al­gorithm whose be­liefs you trust more than your own, or would trust more than your own if some con­di­tions held (which the AI can it­self check), then the AI can op­tim­ize your util­ity func­tion un­der ex­pec­ted value un­der rather than un­der your own be­liefs, since you would prefer that.

• For ex­ample, it could have more com­pu­ta­tional power than you to search for plans which max­im­ize ex­pec­ted util­ity ac­cord­ing to your prob­ab­il­ity and util­ity func­tions. Then, it could tell you the answer

Would it, though? It’s not eval­u­at­ing ac­tions on my fu­ture probutil­ity, oth­er­wise it would wire­head me. It’s eval­u­at­ing ac­tions on my present probutil­ity. So now the an­swer seems to de­pend on whether we al­low “tell me the right an­swer” as a prim­it­ive ac­tion, or if it is eval­u­ated as “tell me [String],” which has low probutil­ity.

But of course, if tell me the right an­swer is prim­it­ive, how do we stop “do the right thing” from be­ing prim­it­ive, which lands us right back in the hot wa­ter of strong op­tim­iz­a­tion of ‘util­ity’ this pro­posal was sup­posed to pre­vent? So I think it should eval­u­ate the spe­cific out­put, which has low prob­ab­il­ity(hu­man), and there­fore not tell you.

• I’ll try and write up a proof that it can do what I think it can.