Comparing reward learning/​reward tampering formalisms

Con­trast­ing formalisms

Here I’ll con­trast the ap­proach we’re us­ing in us­ing in Pit­falls of Learn­ing a Re­ward On­line (sum­marised here), with that used by Tom Ever­itt and Marcu Hut­ter in the con­cep­tu­ally similar Re­ward Tam­per­ing Prob­lems and Solu­tions in Re­in­force­ment Learn­ing. In the fol­low­ing, his­to­ries are se­quences of ac­tions and ob­ser­va­tions ; thus . The agent’s policy is given by , the en­vi­ron­ment is given by .

Then the causal graph for the “Pit­falls” ap­proach is, in plate no­ta­tion (which ba­si­cally means that, for ev­ery value of from to , the graph in­side the rec­t­an­gle is true):

The is the set of re­ward func­tions (map­ping “com­plete” his­to­ries of length to real num­bers), the tells you which re­ward is cor­rect, con­di­tional on com­plete his­to­ries, and is the fi­nal re­ward.

In or­der to move to the re­ward tam­per­ing for­mal­ism, we’ll have to gen­er­al­ise the and , just a bit. We’ll al­low to take par­tial his­to­ries - shorter than - and re­turn a re­ward. Similarly, we’ll gen­er­al­ise to a con­di­tional dis­tri­bu­tion on , con­di­tional on all his­to­ries , not just on com­plete his­to­ries.

This leads to the fol­low­ing graph:

This graph is now gen­eral enough to in­clude re­ward tam­per­ing for­mal­ism.

States, data, and actions

In re­ward tam­per­ing for­mal­ism, “ob­ser­va­tions” () de­com­pose into two pieces: states () and data (). The idea is that data in­forms you about the re­ward func­tion, while states get put into the re­ward func­tion to get the ac­tual re­ward.

So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight gen­er­al­i­sa­tion, as I haven’t as­sumed Marko­vian con­di­tions):

In­side the rec­t­an­gle, the his­to­ries split into data (), states (), and ac­tions (). The re­ward func­tion is defined by the data only, while the re­ward comes from this re­ward func­tion and from the states only—ac­tions don’t di­rectly af­fect these (though they can in­di­rectly af­fect them by de­cid­ing what states and data come up, of course). Note that in the re­ward tam­per­ing pa­per, the au­thors don’t dis­t­in­guish ex­plic­itly be­tween and , but they seem to do so im­plic­itly.

Fi­nally, is the “user’s re­ward func­tion”, which the agent is es­ti­mat­ing via ; this con­nects to the data only.

Al­most all of the prob­a­bil­ity dis­tri­bu­tions at each node are “nat­u­ral” ones that are easy to un­der­stand. For ex­am­ple, there are ar­rows into (the re­ward) from (the re­ward func­tion) and (the states his­tory); the “con­di­tional dis­tri­bu­tion” of is just “ap­ply to . The en­vi­ron­ment, ac­tion, and his­tory nat­u­rally provide the next ob­ser­va­tions (state and data).

Two ar­rows point to more com­pli­cated re­la­tions: the ar­row from to , and that from to . The two are re­lated; the data is sup­posed to tell us about the user’s true re­ward func­tion, while this in­for­ma­tion in­forms the choice of .

But the fact that the nodes and the prob­a­bil­ity dis­tri­bu­tion have been “de­signed” this way doesn’t af­fect the agent. It has a fixed pro­cess for es­ti­mat­ing from ( stands for the prob­a­bil­ity func­tion for the re­ward tam­per­ing for­mal­ism). It has ac­cess to , , and (and their his­to­ries) as well as its own policy, but has no di­rect ac­cess to or .

In fact, from the agent’s per­spec­tive, is es­sen­tially part of , the en­vi­ron­ment, though fo­cus­ing on the only.

States and ac­tions in “Pit­falls” formalism

Now, can we put this into the “Pit­falls” for­mal­ism? It seems we can, as so:

All con­di­tional prob­a­bil­ity dis­tri­bu­tions in this graph are nat­u­ral.

This graph look very similar to the “re­ward tam­per­ing” one, with the ex­cep­tion of and , point­ing at and re­spec­tively.

In fact, play the role of in that, for the prob­a­bil­ity dis­tri­bu­tion for learn­ing pro­cess,

Note that in that ex­pres­sion is nat­u­ral and sim­ple, while is com­plex; es­sen­tially car­ries the same in­for­ma­tion as .

The en­vi­ron­ment of the learn­ing pro­cess plays the same role as the com­bined and from the re­ward tam­per­ing for­mal­ism.

So the iso­mor­phism be­tween the two ap­proaches is, in­for­mally speak­ing:

  1. On re­ward func­tions con­di­tional on his­to­ries, .

  2. .

Un­in­fluence­able similarities

If we make the pro­cesses un­in­fluence­able (a con­cept that ex­ists for both for­mal­isms), the causal graphs look even more similar:

Here the pair , for the learn­ing pro­cess, play ex­actly the same role as the pair[1] , for re­ward tam­per­ing: de­ter­min­ing re­ward func­tions and ob­ser­va­tions.

  1. There is an equiv­alence be­tween the pairs, but not be­tween the in­di­vi­d­ual el­e­ments; thus car­ries more in­for­ma­tion than , while car­ries less in­for­ma­tion than . ↩︎