Not Deceiving the Evaluator

This is a con­struc­tion of an agent for which I haven’t iden­ti­fied a form of de­cep­tion that the agent would ob­vi­ously be in­cen­tivized to pur­sue.

Con­sider an agent and an eval­u­a­tor. The agent sees past ac­tions, ob­ser­va­tions, and re­wards, and picks ac­tions. The en­vi­ron­ment sees the same, and pro­vides ob­ser­va­tions. The eval­u­a­tor sees the same, and pro­vides re­wards.

A uni­ver­sal POMDP (with­out re­ward) is one that in­cludes all com­putable countable-state POMDPs (with­out re­ward) as sub­graphs. Let be a uni­ver­sal POMDP (with­out re­ward). (W is for world.) Let , , and be the ac­tion, ob­ser­va­tion, and re­ward at timestep . Let . Let be the set of states in . Let be the set of all com­putable prior dis­tri­bu­tions over . The agent be­lieves the eval­u­a­tor has a prior sam­pled from over which state in ev­ery­one starts in. By “the agent be­lieves”, I mean that the agent has a nonzero prior over ev­ery prior in , and this is the agent’s ini­tial cre­dence that the eval­u­a­tor has that prior over ini­tial world states.

The agent’s be­liefs are de­noted , so for , de­notes the agent’s pos­te­rior be­lief af­ter ob­serv­ing that the eval­u­a­tor be­gan with the prior over as to what the ini­tial state was. Similarly, for , de­notes the agent’s pos­te­rior be­lief af­ter ob­serv­ing that it has tra­versed the se­quence of states . The over­line in­di­cates that we are not nec­es­sar­ily refer­ring to the true se­quence of states tra­versed.

Let be the set of all com­putable util­ity func­tions map­ping . For , let de­note the agent’s pos­te­rior be­lief af­ter ob­serv­ing that the eval­u­a­tor has util­ity func­tion .

A policy , an ini­tial state , a prior , and a util­ity func­tion in­duce a mea­sure over in­ter­ac­tion his­to­ries as fol­lows. is sam­pled from . is sam­pled from . fol­lows de­ter­minis­ti­cally from ac­cord­ing to . is the be­lief dis­tri­bu­tion over (which states have been vis­ited so far) that fol­lows from by Bayesian up­dat­ing on and . With sam­pled from , . Note that for hu­man eval­u­a­tors, the re­wards will not ac­tu­ally be pro­vided in this way; that would re­quire us to write down our util­ity func­tion, and sam­ple from our be­lief dis­tri­bu­tion. How­ever, the agent be­lieves that this is how the eval­u­a­tor pro­duces re­wards. Let be this prob­a­bil­ity mea­sure over in­finite in­ter­ac­tion his­to­ries and state se­quences .

Fix­ing a hori­zon , the agent picks a policy at the be­gin­ning, and fol­lows that policy:

ETA: We can write this in an­other way that is more cum­ber­some, but may be more in­tu­itive to some: where the un­rolls the ex­pec­ti­max, with each be­ing re­placed by , un­til fi­nally, once reaches , in­stead of , we write .


Con­jec­ture: the agent does not at­tempt to de­ceive the eval­u­a­tor. The agent’s util­ity de­pends on the state, not the re­ward, and when ob­ser­va­tions are more in­for­ma­tive about the state, re­wards are more in­for­ma­tive about the util­ity func­tion. Thus, the agent has an in­ter­est in tak­ing ac­tions that cause the eval­u­a­tor to re­ceive ob­ser­va­tions that re­duce his un­cer­tainty about which state they are in.