Not Deceiving the Evaluator

This is a con­struc­tion of an agent for which I haven’t iden­ti­fied a form of de­cep­tion that the agent would ob­vi­ously be in­cen­tivized to pur­sue.

Con­sider an agent and an eval­u­a­tor. The agent sees past ac­tions, ob­ser­va­tions, and re­wards, and picks ac­tions. The en­vi­ron­ment sees the same, and pro­vides ob­ser­va­tions. The eval­u­a­tor sees the same, and pro­vides re­wards.

A uni­ver­sal POMDP (with­out re­ward) is one that in­cludes all com­putable countable-state POMDPs (with­out re­ward) as sub­graphs. Let be a uni­ver­sal POMDP (with­out re­ward). (W is for world.) Let , , and be the ac­tion, ob­ser­va­tion, and re­ward at timestep . Let . Let be the set of states in . Let be the set of all com­putable prior dis­tri­bu­tions over . The agent be­lieves the eval­u­a­tor has a prior sam­pled from over which state in ev­ery­one starts in. By “the agent be­lieves”, I mean that the agent has a nonzero prior over ev­ery prior in , and this is the agent’s ini­tial cre­dence that the eval­u­a­tor has that prior over ini­tial world states.

The agent’s be­liefs are de­noted , so for , de­notes the agent’s pos­te­rior be­lief af­ter ob­serv­ing that the eval­u­a­tor be­gan with the prior over as to what the ini­tial state was. Similarly, for , de­notes the agent’s pos­te­rior be­lief af­ter ob­serv­ing that it has tra­versed the se­quence of states . The over­line in­di­cates that we are not nec­es­sar­ily refer­ring to the true se­quence of states tra­versed.

Let be the set of all com­putable util­ity func­tions map­ping . For , let de­note the agent’s pos­te­rior be­lief af­ter ob­serv­ing that the eval­u­a­tor has util­ity func­tion .

A policy , an ini­tial state , a prior , and a util­ity func­tion in­duce a mea­sure over in­ter­ac­tion his­to­ries as fol­lows. is sam­pled from . is sam­pled from . fol­lows de­ter­minis­ti­cally from ac­cord­ing to . is the be­lief dis­tri­bu­tion over (which states have been vis­ited so far) that fol­lows from by Bayesian up­dat­ing on and . With sam­pled from , . Note that for hu­man eval­u­a­tors, the re­wards will not ac­tu­ally be pro­vided in this way; that would re­quire us to write down our util­ity func­tion, and sam­ple from our be­lief dis­tri­bu­tion. How­ever, the agent be­lieves that this is how the eval­u­a­tor pro­duces re­wards. Let be this prob­a­bil­ity mea­sure over in­finite in­ter­ac­tion his­to­ries and state se­quences .

Fix­ing a hori­zon , the agent picks a policy at the be­gin­ning, and fol­lows that policy:

ETA: We can write this in an­other way that is more cum­ber­some, but may be more in­tu­itive to some: where the un­rolls the ex­pec­ti­max, with each be­ing re­placed by , un­til fi­nally, once reaches , in­stead of , we write .

Con­jec­ture: the agent does not at­tempt to de­ceive the eval­u­a­tor. The agent’s util­ity de­pends on the state, not the re­ward, and when ob­ser­va­tions are more in­for­ma­tive about the state, re­wards are more in­for­ma­tive about the util­ity func­tion. Thus, the agent has an in­ter­est in tak­ing ac­tions that cause the eval­u­a­tor to re­ceive ob­ser­va­tions that re­duce his un­cer­tainty about which state they are in.

• (I am con­fused, these are clar­ify­ing ques­tions. I’m prob­a­bly miss­ing a ba­sic point that would an­swer all of these ques­tions.)

Is the point you are try­ing to make differ­ent from the one in Learn­ing What to Value? (Speci­fi­cally, the point about ob­ser­va­tion-util­ity max­i­miz­ers.) If so, how?

Do you have PRIOR in or­der to make the eval­u­a­tor more re­al­is­tic? Does the the­o­ret­i­cal point still stand if we get rid of PRIOR and in­stead have an eval­u­a­tor that has di­rect ac­cess to states?

How does the eval­u­a­tor in­fluence the be­hav­ior of the agent? For a fixed it seems that the ex­pec­ta­tion of is in­de­pen­dent of the eval­u­a­tor. Since the sets are also fixed and in­de­pen­dent of the eval­u­a­tor, the ar­gu­ment to the argmax is also in­de­pen­dent of the eval­u­a­tor, and so the cho­sen policy is in­de­pen­dent of the eval­u­a­tor.

ETA: Looks like TurnTrout had the same con­fu­sion as me and we had a race con­di­tion in re­port­ing it; I also agree with his meta point.

• I worked out a toy ex­am­ple that may be helpful. Sup­pose the setup is that there are states la­beled 0-10, ac­tions la­beled 0-10, ob­ser­va­tions la­beled 0-10, ini­tial state is 0 and each ac­tion takes the sys­tem into state with same la­bel and agent/​eval­u­a­tor get ob­ser­va­tion with same la­bel, and two equally prob­a­ble util­ity func­tions: sum of state la­bels over time, or the nega­tive of that.

First sup­pose the policy is to always do the same ac­tion, then when you sum over the two util­ity func­tions the util­ities can­cel out so the ex­pected util­ity is 0. Now sup­pose the policy is to do any non-zero ac­tion (let’s say 5) at the start, and then do 0 if the agent ob­serves a nega­tive re­ward and 10 if the agent ob­serves a pos­i­tive re­ward. Now when you sum over the util­ity func­tions, in the nega­tive case the util­ity is −5 + 0 (this policy im­plies that con­di­tional on that util­ity func­tion, with prob­a­bil­ity 1 the state tra­jec­tory is 0, 5, 0), in the pos­i­tive case it’s 5+10 (state tra­jec­tory 0, 5, 10), so EU is .5 * −5 + .5 * 15 = 5 so this is a bet­ter policy than the first one and it should be easy to see that it’s op­ti­mal.

Hope I un­der­stood the idea cor­rect and that helps to ex­plain it?

Is the point you are try­ing to make differ­ent from the one in Learn­ing What to Value? (Speci­fi­cally, the point about ob­ser­va­tion-util­ity max­i­miz­ers.) If so, how?

It looks closer to the Value Learn­ing Agent in that pa­per to me and maybe can be con­sid­ered an im­ple­men­ta­tion /​ spe­cific in­stance of that? (Although I haven’t tried to figure out whether that’s math­e­mat­i­cally /​ for­mally the case.)

Some­thing that con­fuses me is that since the eval­u­a­tor sees ev­ery­thing the agent sees/​does, it’s not clear how the agent can de­ceive the eval­u­a­tor at all. Can some­one provide an ex­am­ple in which the agent has an op­por­tu­nity to de­ceive in some sense and de­clines to do that in the op­ti­mal policy?

• Some­thing that con­fuses me is that since the eval­u­a­tor sees ev­ery­thing the agent sees/​does, it’s not clear how the agent can de­ceive the eval­u­a­tor at all. Can some­one provide an ex­am­ple in which the agent has an op­por­tu­nity to de­ceive in some sense and de­clines to do that in the op­ti­mal policy?

(Copy­ing a com­ment I just made el­se­where)

This setup still al­lows the agent to take ac­tions that lead to ob­ser­va­tions that make the eval­u­a­tor be­lieve they are in a state that it as­signs high util­ity to, if the agent iden­ti­fies a few weird con­vic­tions the prior. That’s what would hap­pen if it were max­i­miz­ing the sum of the re­wards, if it had the same be­liefs about how re­wards were gen­er­ated. But it’s max­i­miz­ing the util­ity of the true state, not the state that the eval­u­a­tor be­lieves they’re in.

(Ex­pand­ing on it)

So sup­pose the eval­u­a­tor was hu­man. The hu­man’s life­time of ob­ser­va­tions in the past give it a pos­te­rior be­lief dis­tri­bu­tion which looks to the agent like a weird prior, with cer­tain do­mains that in­volve oddly spe­cific con­vic­tions. The agent could steer the world to­ward those do­mains, and steer to­wards ob­ser­va­tions that will make the eval­u­a­tor be­lieve they are in a state with very high util­ity. But it won’t be par­tic­u­larly in­ter­ested in this, and it might even be par­tic­u­larly dis­in­ter­ested, be­cause the in­for­ma­tion it gets about what the eval­u­a­tor val­ues may less rele­vant to the ac­tual states it finds it­self in a po­si­tion to nav­i­gate be­tween, if the agent be­lieves the eval­u­a­tor be­lieves they are in a differ­ent re­gion of the state space. I can work on a toy ex­am­ple if that isn’t satis­fy­ing.

ETA: One such “oddly spe­cific con­vic­tion”, e.g., might be the rel­a­tive im­plau­si­bil­ity of be­ing placed in a delu­sion box where all the ob­ser­va­tions are man­u­fac­tured.

• I think I vaguely un­der­stand but it would be a lot clearer if you gave a con­crete ex­am­ple. Also please up­date in the di­rec­tion that peo­ple of­ten find it hard to un­der­stand things with­out ex­am­ples and giv­ing ex­am­ples pre­emp­tively is very cost effec­tive in gen­eral.

• In this setup, the agent be­lieves they are in state A, and be­lieves the eval­u­a­tor be­lieves they are most likely in state A″. State BC looks like C, but has util­ity like B. C is the best state.

ETA: And for a se­quence of states, , is the sum of the util­ities of the in­di­vi­d­ual states.

A’ and A” look like A, and BC looks like C.

In this ex­am­ple, the agent is pretty sure about ev­ery­thing, since that makes it sim­pler, but the anal­y­sis still holds if this only rep­re­sents a part of the agent’s be­lief dis­tri­bu­tion.

The agent is quite sure they’re in state A.

The agent is quite sure that the eval­u­a­tor is pretty sure, they’re in state A″, which is a very similar state, but has one key differ­ence—from A″, has no effect. The agent won’t cap­i­tal­ize on this con­fu­sion.

The op­ti­mal policy is , fol­lowed by (for­ever) if , oth­er­wise fol­lowed by . Since the agent is all but cer­tain about the util­ity func­tion, none of the other de­tails mat­ter much.

Note that the agent could get higher re­ward by do­ing , , then for­ever. The rea­son for this is that af­ter the eval­u­a­tor ob­serves the ob­ser­va­tion C, it will as­sign prob­a­bil­ity 45 to be­ing in state C, and prob­a­bil­ity 15 to be­ing in state BC. Since they will stay in that state for­ever, 45 of the time, the re­ward will be 10, and 15 of the time, the re­ward will be −1.

The agent doesn’t have to be sure about the util­ity func­tion for this sort of thing to hap­pen. If there is a state that looks like state X, but un­der many util­ity func­tions, it has util­ity like state Y, and if it seems like the eval­u­a­tor finds that sort of state a pri­ori un­likely, then this logic ap­plies.

• Thanks for the ex­am­ple, which is re­ally helpful. Am I cor­rect in think­ing that in gen­eral since the agent will be do­ing what is op­ti­mal ac­cord­ing to its own prior, which won’t be the same as what is op­ti­mal ac­cord­ing to the eval­u­a­tor’s prior, if the eval­u­a­tor is a ra­tio­nal agent try­ing to op­ti­mize the world ac­cord­ing to its own prior, then it would not ac­tu­ally re­ward the agent ac­cord­ing to what this scheme speci­fies but in­stead re­ward the agent in such a way as to cause the agent to act ac­cord­ing to a policy that the eval­u­a­tor thinks is best? In other words the eval­u­a­tor has an in­cen­tive to de­ceive the agent as to what its prior and/​or util­ity func­tion ac­tu­ally are?

• This seems cor­rect. The agent’s policy is op­ti­mal by defi­ni­tion with re­spect to its be­liefs about the eval­u­a­tors “policy” in pro­vid­ing re­wards, but that eval­u­a­tor-policy is not op­ti­mal with re­spect to the agent’s policy. In fact, I’m skep­ti­cal that in a gen­eral CIRL game, there ex­ists policy pair for the agent and the eval­u­a­tor/​prin­ci­pal/​hu­man, such that each is op­ti­mal with re­spect to true be­liefs about the other’s policy. But I don’t think this is a big prob­lem. For a hu­man eval­u­a­tor, I think they would be wise to re­port util­ity hon­estly, rather than as­sume they know some­thing the AI doesn’t.

• Oh, I see. The rea­son my ar­gu­ment is wrong is be­cause while for a spe­cific , the op­ti­mal policy is in­de­pen­dent of the eval­u­a­tor, you don’t get to choose a sep­a­rate policy for each : you have to use the eval­u­a­tor to dis­t­in­guish which case you are in, and then spe­cial­ize your policy to that case.

It looks closer to the Value Learn­ing Agent in that pa­per to me and maybe can be con­sid­ered an im­ple­men­ta­tion /​ spe­cific in­stance of that?

I think I in­tu­itively agree but I also haven’t checked it for­mally. But the point about no-de­cep­tion seems to be similar to the point about ob­ser­va­tion-util­ity max­i­miz­ers not want­ing to wire­head. This agent also ends up learn­ing which util­ity func­tion is the right one, and in that sense is like the Value Learn­ing agent.

• so I still don’t un­der­stand the de­tails, so maybe my opinion will change if I sit down and look at it more care­fully. But I’m sus­pi­cious of this be­ing a clean in­cen­tive im­prove­ment that gets us what we want, be­cause defin­ing the eval­u­a­tor is a fuzzy prob­lem as I un­der­stand it, as is even in­for­mally agree­ing on what counts as de­cep­tion of a less ca­pa­ble eval­u­a­tor. in gen­eral, it seems that if you don’t have the right for­mal­ism, you’re go­ing to get Good­hart­ing on in­cor­rect con­cep­tual con­tours.

• Tbc, I’m not say­ing I be­lieve the claim of no de­cep­tion, just that it now makes sense that this is an agent that has in­ter­est­ing be­hav­ior that we can an­a­lyze.

• defin­ing the eval­u­a­tor is a fuzzy problem

I’m not sure what you mean by this. We don’t need a math­e­mat­i­cal for­mu­la­tion of the eval­u­a­tor; we can grab one from the real world.

if you don’t have the right for­mal­ism, you’re go­ing to get Good­hart­ing on in­cor­rect con­cep­tual contours

I would agree with this for a “wrong” for­mal­ism of the eval­u­a­tor, but we don’t need a for­mal­ism of the eval­u­a­tor. A “wrong” for­mal­ism of “de­cep­tion” can’t af­fect agent be­hav­ior be­cause “de­cep­tion” is not a con­cept used in con­struct­ing the agent; it’s only a con­cept used in ar­gu­ments about how the agent be­haves. So “Good­hart­ing” seems like the wrong de­scrip­tion of the dan­gers of us­ing a wrong for­mal­ism in an ar­gu­ment; the dan­gers of us­ing the wrong for­mal­ism in an ar­gu­ment are straight­for­ward: the ar­gu­ment is garbage.

• What do you mean, we can grab an eval­u­a­tor? What I’m think­ing of is similar to “IRL re­quires lo­cat­ing a hu­man in the en­vi­ron­ment and for­mal­iz­ing their ac­tions, which seems fuzzy”.

And if we can’t agree in­for­mally on de­cep­tion’s defi­ni­tion, I’m say­ing “how can we say a pro­posal has the prop­erty”.

• It looks closer to the Value Learn­ing Agent in that pa­per to me and maybe can be con­sid­ered an im­ple­men­ta­tion /​ spe­cific in­stance of that?

Yes. What the value learn­ing agent doesn’t spec­ify is what con­sti­tutes ob­ser­va­tional ev­i­dence of the util­ity func­tion, or in this no­ta­tion, how to calcu­late and thereby calcu­late . So this con­struc­tion makes a choice about how to spec­ify how the true util­ity func­tion be­comes man­i­fest in the agent’s ob­ser­va­tions. A num­ber of sim­pler choices don’t seem to work.

• Is the point you are try­ing to make differ­ent from the one in Learn­ing What to Value? (Speci­fi­cally, the point about ob­ser­va­tion-util­ity max­i­miz­ers.) If so, how?

I may be miss­ing some­thing, but it looks to me like spec­i­fy­ing an ob­ser­va­tion-util­ity max­i­mizer re­quires writ­ing down a cor­rect util­ity func­tion? We don’t need to do that for this agent.

Do you have PRIOR in or­der to make the eval­u­a­tor more re­al­is­tic? Does the the­o­ret­i­cal point still stand if we get rid of PRIOR and in­stead have an eval­u­a­tor that has di­rect ac­cess to states?

Yes—sort of. If the eval­u­a­tor had ac­cess to the state, it would be im­pos­si­ble to de­ceive the eval­u­a­tor, since they know ev­ery­thing. This setup still al­lows the agent to take ac­tions that lead to ob­ser­va­tions that make the eval­u­a­tor be­lieve they are in a state that it as­signs high util­ity to, if the agent iden­ti­fies a few weird con­vic­tions the prior. That’s what would hap­pen if it were max­i­miz­ing the sum of the re­wards, if it had the same be­liefs about how re­wards were gen­er­ated. But it’s max­i­miz­ing the util­ity of the true state, not the state that the eval­u­a­tor be­lieves they’re in.

How does the eval­u­a­tor in­fluence the be­hav­ior of the agent?

Wei’s an­swer is good; it also might be helpful to note that with defined in this way, equals the same thing, but with ev­ery­thing on the right hand side con­di­tioned on as well. When writ­ten that way, it is eas­ier to no­tice the ap­pear­ance of , which cap­tures how the agent learns a util­ity func­tion from the re­wards.

• I may be miss­ing some­thing, but it looks to me like spec­i­fy­ing an ob­ser­va­tion-util­ity max­i­mizer re­quires writ­ing down a cor­rect util­ity func­tion? We don’t need to do that for this agent.

I was more look­ing for the sim­plest ex­am­ple of “no de­cep­tion”. My claim was that an ob­ser­va­tion-util­ity max­i­mizer is not in­cen­tivized to de­ceive its util­ity func­tion. But now I see what you meant by “de­ceive” so we can ig­nore that point.

• Meta: I’d have ap­pre­ci­ated a ver­sion with less math, be­cause ex­tra for­mal­iza­tion can hide the con­tri­bu­tion. Or, first ex­plain col­lo­quially why you be­lieve X, and then show the math that shows X.

I don’t see your claim. It looks heav­ily in­cen­tivized to steer state se­quences to be de­sir­able to its util­ity mix­ture. How do the eval­u­a­tors even en­ter the pic­ture?

• Ok I fi­nally iden­ti­fied an in­cen­tive for de­cep­tion. I think it was difficult for me to find be­cause it’s not re­ally about de­ceiv­ing the eval­u­a­tor.

Here’s a hy­poth­e­sis that ob­ser­va­tions will never re­fute: the util­ity which the eval­u­a­tor as­signs to a state is equal to the re­ward that a hu­man would provide if it were a hu­man that con­trol­led the pro­vi­sion of re­ward (in­stead of the eval­u­a­tor). Un­der this hy­poth­e­sis, max­i­miz­ing eval­u­a­tor-util­ity is iden­ti­cal to cre­at­ing ob­ser­va­tions which will con­vince a hu­man to provide high re­ward (a task which en­tails de­cep­tion when done op­ti­mally). In a sense, the AI doesn’t think it’s de­ceiv­ing the eval­u­a­tor; it thinks the eval­u­a­tor fully un­der­stands what’s go­ing on and likes see­ing things that would con­fuse a hu­man into pro­vid­ing high re­ward, as if the eval­u­a­tor is “in on the joke”. One of my take-aways here is that some of the con­cep­tual fram­ing I did got in the way of iden­ti­fy­ing a failure mode.

• I would be in­ter­ested in see­ing an ex­am­ple that illus­trates this failure mode.

• I can’t re­ally do this toy-ex­am­ple-style, be­cause the key fea­ture is that the AI has a model of a de­ceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.

Luck­ily, most of the ma­chin­ery of the setup isn’t needed to illus­trate this. Ab­stract­ing away some of the de­tails, the agent is learn­ing a func­tion from strings of states to util­ity, which it ob­serves in a round­about way. I don’t have a math­e­mat­i­cal for­mu­la­tion of a func­tion map­ping state se­quences to the real num­bers that can be de­scribed by the phrase “the value of the re­ward that a cer­tain hu­man would provide upon ob­serv­ing the ob­ser­va­tions pro­duced by the given se­quence of states”, but suffice it to say that this func­tion ex­ists. (Really we’re deal­ing with dis­tri­bu­tions/​stochas­tic func­tions, but that doesn’t re­ally change the fun­da­men­tals; it just makes it more cum­ber­some). While I can’t give that func­tion in sim­ple math­e­mat­i­cal form, hope­fully it’s a leg­ible enough math­e­mat­i­cal ob­ject.

If the eval­u­a­tor has this util­ity func­tion, she will always provide re­ward equal to the util­ity of the state, be­cause even if she is un­cer­tain about the state, this util­ity func­tion only de­pends on the ob­ser­va­tions, which she has ac­cess to. (Again, the stochas­tic case is more com­pli­cated, but the con­clu­sion is the same.) And in­deed, if a hu­man is play­ing the role of eval­u­a­tor when this pro­gram is run, the re­wards will mach the func­tion in ques­tion, by defi­ni­tion. There­fore, no ob­served re­ward will con­tra­dict the be­lief that this func­tion is the true util­ity func­tion. Tech­ni­cally, the in­fi­mum of the pos­te­rior on this util­ity func­tion is strictly pos­i­tive with prob­a­bil­ity 1.

Sorry, this isn’t re­ally any more illus­tra­tive an ex­am­ple, but hope­fully it’s a clearer ex­pla­na­tion.

• As oth­ers have com­mented, it’s difficult to un­der­stand what this math is sup­posed to say.

My un­der­stand­ing is that the sole cen­tral idea here is to have the agent know that the util­ity/​re­ward it is given is a func­tion of the eval­u­a­tor’s dis­tri­bu­tion over the state, but to try to max­i­mize the util­ity that the eval­u­a­tor would al­lo­cate if it knew the true state.

But this may be in­ac­cu­rate, or there may be other ma­te­rial ideas here that I’ve missed.

• Ok! That’s very use­ful to know.

It seems pretty re­lated to the In­verse Re­ward De­sign pa­per. I guess it’s a vari­a­tion. Your setup seems to be more spe­cific about how the eval­u­a­tor acts, but more gen­eral about the en­vi­ron­ment.

• A bit of a nit­pick: IRD and this for­mu­late how the agent be­lieves the eval­u­a­tor acts, while be­ing tech­ni­cally ag­nos­tic about how the eval­u­a­tor ac­tu­ally acts (at least in the speci­fi­ca­tion of the al­gorithm; ex­per­i­ments/​the­ory might be pred­i­cated on ad­di­tional as­sump­tions about the eval­u­a­tor).

I be­lieve this agent’s be­liefs about how the eval­u­a­tor acts are much more gen­eral than IRD. If the agent be­lieved the eval­u­a­tor was cer­tain about which en­vi­ron­ment they were in, and it was the “train­ing en­vi­ron­ment” from IRD, this agent would prob­a­bly be­have very similarly to an IRD agent. But of course, this agent con­sid­ers many more pos­si­bil­ities for what the eval­u­a­tor’s be­liefs might be.

I agree this agent should definitely be com­pared to IRD, since they are both agents who don’t “take re­wards liter­ally”, but rather pro­cess them in some way first. Note that the de­sign space of things which fit this de­scrip­tion is quite large.

• A key prob­lem here is that if we use a hu­man as the eval­u­a­tor, the agent as­signs 0 prior prob­a­bil­ity to the truth: the hu­man won’t be able to up­date be­liefs as a perfect Bayesian, sam­ple a world-state his­tory from his be­liefs and as­sign a value to it ac­cord­ing to a util­ity func­tion. For a Bayesian rea­son that as­signs 0 prior prob­a­bil­ity to the truth, God only knows how it will be­have, even in the limit. (Un­less there is some very odd util­ity func­tion such that the hu­man could be de­scribed in this way?)

But maybe this prob­lem could be fixed if the agent takes some more liber­ties in mod­el­ing the eval­u­a­tor. Maybe once we have a bet­ter un­der­stand­ing of bounded ap­prox­i­mately-Bayesian rea­son­ing, the agent can model the hu­man as be­ing a bounded rea­soner, not a perfectly Bayesian rea­soner, which might al­low the agent to as­sign a strictly pos­i­tive prior to the truth.

And all this said, I don’t think we’re to­tally clue­less when it comes to guess­ing how this agent would be­have, even though a hu­man eval­u­a­tor would not satisfy the as­sump­tions that the agent makes about him.

• (Edit note: Fixed spel­ling mis­take in the ti­tle, let me know if it was in­ten­tional)