# Problems integrating decision theory and inverse reinforcement learning

In this post I con­sider a sin­gle hy­po­thet­i­cal which po­ten­tially has far-reach­ing im­pli­ca­tions for the fu­ture of AI de­vel­op­ment and de­ploy­ment. It has to do with a com­plex in­ter­ac­tions be­tween the as­sump­tions of which de­ci­sion the­ory hu­mans use and the method used to in­fer their val­ues, such as some­thing like an in­verse re­in­force­ment learn­ing al­gorithm.

Con­sider the New­comb’s prob­lem. We have two boxes, box A and box B. Box B always has \$1000. Box A has \$1,000,000 if and only if a near perfect pre­dic­tor Omega pre­dicts that the agent picks only Box A. We have two agents: agent1, who one boxes (it’s an FDT agent) and agent2 who two-boxes (a CDT agent). In ad­di­tion to Omega, there is an in­verse re­in­force­ment learner (later ab­bre­vi­ated as IRL) try­ing to in­fer the agent’s “val­ues” from it’s be­hav­ior.

What kinds of re­ward sig­nals does the IRL as­sume that agent1 or agent2 have? I claim that in the sim­plis­tic case of just look­ing at two pos­si­ble ac­tions, it will likely as­sume that agent1 val­ues the lack of money be­cause it fails to pick box2. It will cor­rectly de­duce that agent2 val­ues money.

In effect, a naïve IRL learner as­sumes CDT as the agent’s de­ci­sion the­ory and it will fail to ad­just to learn­ing about more so­phis­ti­cated agents (in­clud­ing hu­mans).

This de­pends a lit­tle bit on the setup of the IRL agent and the ex­act na­ture of the states fed into it. I am gen­er­ally look­ing at the fol­low­ing setup of IRL Since we have a finite state and ac­tion space, the IRL learner sim­ply tries to pick a hy­poth­e­sis set of re­ward func­tions which place the high­est value on the ac­tion taken by agent com­pared to other ac­tions.

This also de­pends on the ex­act defi­ni­tion of which “ac­tions we are con­sid­er­ing. If we have po­ten­tial ac­tions of “pick one box” or “pick two boxes,” the IRL agent would think that agent1’s prefer­ences are re­versed from the its ac­tual prefer­ences.

This is bad, very ex­tremely bad, since even the op­po­site of the util­ity func­tion is now in the hy­poth­e­sis set.

If, for ex­am­ple, we have three ac­tions of “pick one box”, “pick two boxes” or “do noth­ing,” then the prefer­ence of “pick one box” over “do noth­ing” re­moves the re­verse of agent1 re­ward func­tion from the hy­poth­e­sis set. It, how­ever does not put the re­ward func­tion of “max­i­mize money” into the hy­poth­e­sis set.

I can think of sev­eral ob­jec­tions /​ fixes to the prob­lem ex­ist­ing, but I don’t think they make the prob­lem eas­ily re­solv­able.

a) Agent1 ac­tu­ally makes more money in the long run.

This is true, but we are not com­par­ing the de­ci­sion of agent0 to be­come agent1 or agent2, rather de­ci­sion of agent1 to take or not take the sec­ond box. Also, if agents face these kinds of de­ci­sions, it’s hard to be con­sis­tent from the IRLs point of view.

b) IRL could in­cor­po­rate de­ci­sion the­ory of the agent into its es­ti­mates of it’s re­ward func­tion.

This might hap­pen, but it’s an un-called out hard prob­lem since it re­quires learn­ing the de­ci­sion the­ory of an agent. If you as­sume ev­ery agent is us­ing FDT, this still gives the wrong an­swer if they are not.

c) IRL could try to use only sim­ple hy­po­thet­i­cals. It’s un­clear if these are enough to in­fer the com­plex­ity of value.

d) IRL could use agent’s in­ter­nal eval­u­a­tions to bet­ter in­fer it’s true re­ward func­tion – this be­comes too cir­cu­lar, since we are try­ing to learn true util­ity, which is highly re­lated to it’s re­ward func­tion.

e) IRL could av­er­age out more situ­a­tions, would work only if New­comb’s prob­lems don’t dom­i­nate the de­ci­sion land­scape.

f) IRL could simu­late “the de­ci­sion to be­come the type of agent that one-boxes” in­stead of “de­ci­sion to one-box”. Once again, this feels like as­sum­ing FDT.

So, we have a prob­lem of in­verse re­in­force­ment learn­ing be­ing de­pen­dent both on the de­ci­sion the­ory of the agent in ques­tions and whether sets of ac­tions in­clude a de­fault do noth­ing ac­tion.

This seems like a non-triv­ial prob­lem, but there are other com­pli­ca­tions that arise as well.

The ma­jor meta-com­pli­ca­tion with this setup. If agent1 is truly an FDT agent *and* it knows that IRL ex­ists, it would act differ­ently de­pend­ing on the ex­act amount of value that the IRL can al­lo­cate to it later. If the amount of value dwarfs the amount it would get from New­comb’s prob­lems, it would start two-box­ing. Thus, if there is a naïve IRL agent which is also an AGI, it’s po­ten­tial pres­ence could change peo­ple’s be­hav­ior to be less co­op­er­a­tive with each other.

The other com­pli­ca­tion meta from the pre­vi­ous one, is that the IRL’s learn­ing sys­tem has an effect on agents it is ob­serv­ing. This cre­ates strange in­cen­tives for the IRL it­self. It sud­denly might be aware that its own value learn­ing sys­tem is caus­ing prob­lems in the world ac­cord­ing to ei­ther peo­ple’s judge­ment or a cer­tain ini­tial util­ity func­tion. Depend­ing on how closely it is sup­posed to guard its value sys­tem, it could try to re­move it. In ei­ther case, the AGI’s loop of un­der­stand­ing its own po­ten­tially nega­tive im­pact through simu­lat­ing hu­mans cre­ates an in­her­ent in­sta­bil­ity and self-ten­sion.