Reinforcement learning with imperceptible rewards

TLDR: We define a var­i­ant of re­in­force­ment learn­ing in which the re­ward is not per­ceived di­rectly, but can be es­ti­mated at any given mo­ment by some (pos­si­bly costly) ex­per­i­ment. The re­ward func­tion is no longer a func­tion of the ob­ser­va­tion his­tory, but a differ­ent ob­ject that we call “in­stru­men­tal re­ward func­tion”. We give two defi­ni­tions of in­stru­men­tal re­ward func­tion and prove their equiv­alence. We also de­rive a re­gret bound for this set­ting.


In “clas­si­cal” re­in­force­ment learn­ing the agent per­ceives the re­ward sig­nal on ev­ery round of its in­ter­ac­tion with the en­vi­ron­ment, whether through a dis­tinct in­put chan­nel or through some given way to com­pute the re­ward from the in­ter­ac­tion his­tory so far. On the other hand, we can rather eas­ily imag­ine agents that op­ti­mize prop­er­ties of their en­vi­ron­ment that they do not di­rectly per­ceive. For ex­am­ple, if Alice, who lives in Do­minica, donates money to the Against Malaria Foun­da­tion in or­der to save some­one in Africa, then the re­sult is usu­ally not visi­ble to Alice at the time it oc­curs, if ever. Similarly, Clippy the pa­per­clip max­i­mizer doesn’t always per­ceive all the pa­per­clips in the uni­verse. More­over, we might want to de­sign agents that, in or­der to es­ti­mate the re­ward, di­rect queries to hu­mans (which is costly and can­not be done con­tin­u­ously non-stop).

Now, it is pos­si­ble to define the per­ceived re­ward as the sub­jec­tive ex­pected value of the “true” im­per­cep­ti­ble re­ward (see the Re­sults sec­tion for de­tails). Although this trans­for­ma­tion pre­serves ex­pected util­ity, it does not pre­serve Bayesian re­gret. In­deed, Bayesian re­gret is the differ­ence be­tween the ex­pected util­ity at­tained by the agent and the ex­pected util­ity at­tained by a “refer­ence” agent that knows the true en­vi­ron­ment from the on­set. How­ever, af­ter the trans­for­ma­tion, the refer­ence agent will be­have as if it knows the ob­serv­able dy­nam­ics of the true en­vi­ron­ment but still pre­tends not to know the true en­vi­ron­ment for the pur­pose of com­put­ing the re­ward. There­fore, re­gret anal­y­sis re­quires us to con­sider the im­per­cep­ti­ble re­ward hon­estly. In fact, as we will see, cer­tain as­sump­tions about the im­per­cep­ti­ble re­ward func­tion are needed even to make the prob­lem learn­able at all. Fi­nally, this trans­for­ma­tion makes the re­ward func­tion more com­plex and hence ap­ply­ing a “generic” re­in­force­ment learn­ing al­gorithm af­ter the trans­for­ma­tion (in­stead of ex­ploit­ing the spe­cial form of the re­sult­ing re­ward func­tion) might carry a sig­nifi­cant com­pu­ta­tional com­plex­ity penalty.

Re­lated Work: De Blanc 2011 stud­ies so called “on­tolog­i­cal crises”. That is, de Blanc ex­am­ines the prob­lem of trans­lat­ing a re­ward func­tion from one on­tol­ogy into an­other. Here, we avoid this prob­lem by con­sid­er­ing re­ward func­tions that are au­to­mat­i­cally defined in all on­tolo­gies. That said, it might still be in­ter­est­ing to think about how to spec­ify our type of re­ward func­tion start­ing from a re­ward func­tion that is only defined in one par­tic­u­lar on­tol­ogy. We will re­turn to this in the Dis­cus­sion sec­tion.

Krueger et al 2016 con­sider a set­ting where query­ing the re­ward is in­stan­ta­neous and has a fixed cost. This is a spe­cial case of our set­ting, but we al­low a much more gen­eral type of query and cost. Also, Krueger et al don’t de­rive any the­o­ret­i­cal re­gret bound (but they do pre­sent some ex­per­i­men­tal re­sults).

Fi­nally, multi-armed ban­dits with par­tial mon­i­tor­ing are closely re­lated, see for ex­am­ple Bar­tok et al 2013. How­ever, ban­dits by defi­ni­tion as­sume a state­less en­vi­ron­ment, and also our ap­proach is rather differ­ent than what is usu­ally stud­ied in par­tial mon­i­tor­ing.

The liter­a­ture study was very cur­sory and I will be glad to know about prior work I missed!


Par­tially Ob­serv­able MDPs with Im­per­cep­ti­ble Rewards

Par­tially Ob­serv­able Markov De­ci­sion Pro­cesses (POMDPs) serve as a nat­u­ral start­ing point for think­ing about im­per­cep­ti­ble re­wards. In­deed, it might seem like all we have to do is con­sider RL in a POMDP en­vi­ron­ment and let the re­ward to be a func­tion of the (im­per­cep­ti­ble) state. How­ever, this set­ting is in gen­eral un­learn­able even given as­sump­tions that rule out traps.

A (finite) POMDP is defined by non-empty finite sets (states), (ac­tions) and (ob­ser­va­tions), the tran­si­tion ker­nel and the re­ward func­tion . As op­posed to the “clas­si­cal” defi­ni­tion of POMDP, we al­low the value of is to be im­per­cep­ti­ble. The per­cep­ti­ble set­ting is a spe­cial case of the im­per­cep­ti­ble set­ting: we can always en­code the re­ward into the ob­ser­va­tion.

To for­mu­late a learn­ing prob­lem, we as­sume , and to be fixed and an ini­tial state to be given, while and are un­known and be­long to some hy­poth­e­sis class :

Such a prob­lem can be un­learn­able even if is known and there are no ir­re­versible events that can hap­pen:

Ex­am­ple 1

Sup­pose that , , , , and where for any

Since , there is no way to gain any in­for­ma­tion about which hy­poth­e­sis is cor­rect. More­over, the op­ti­mal poli­cies for the two hy­pothe­ses are differ­ent. Namely, for we should always take ac­tion and for we should always take ac­tion .

To for­mal­ize and illus­trate the dis­cus­sion in the Back­ground sec­tion, sup­pose is Borel and is the prior. We can then define the “per­ceived effec­tive re­ward” by

It is then easy to see that the op­er­a­tor pre­serves ex­pected util­ity: given any policy and

and there­fore, for any ge­o­met­ric time dis­count pa­ram­e­ter

On the other hand, Bayesian re­gret is not pre­served since, in general

Here is short­hand no­ta­tion for the same prob­a­bil­ity dis­tri­bu­tion as be­fore.

In­deed, in Ex­am­ple 1 the LHS of the above is since , whereas the RHS is .

The pathol­ogy of Ex­am­ple 1 comes about be­cause re­ward is not only im­per­cep­ti­ble but en­tirely un­ob­serv­able. That is, no ex­per­i­ment can pro­duce any in­for­ma­tion about whether the re­ward on a given round was 0 or 1. More speci­fi­cally, the states and are as­signed differ­ent re­wards, but there is no ob­serv­able differ­ence be­tween them. It is as if Alice would as­sign value, not to peo­ple in Africa (whose ex­is­tence and well-be­ing can be mea­sured) but to some Fly­ing Spaghetti Mon­ster s.t. the world be­haves ex­actly the same re­gard­less of its ex­is­tence or con­di­tion.

This ob­ser­va­tion sug­gests that, in­stead of as­sign­ing re­wards to states which are just ab­stract la­bels in a model, we should as­sign re­wards to states that are defined in terms of the ob­serv­able con­se­quences of the in­ter­ac­tion of the agent with the en­vi­ron­ment. This leads us to the no­tion of “in­stru­men­tal state”, which we will now define for­mally.

In­stru­men­tal States and Re­ward Functions

First, we in­tro­duce some tech­ni­cal defi­ni­tions for no­ta­tional con­ve­nience.

Defi­ni­tion 1

is the cat­e­gory whose ob­jects are pairs where is a real vec­tor space and is a con­vex sub­set of , and whose mor­phisms are

We omit de­scribing iden­tity and com­po­si­tion of mor­phisms since they are ob­vi­ous.

It is easy to see that has ar­bi­trary limits. In par­tic­u­lar, has a fi­nal ob­ject that we de­note by (the one point set), prod­ucts () and in­verse limits of se­quences. For any finite set , is an ob­ject in . Also, we will some­times abuse no­ta­tion by re­gard­ing as an ob­ject of in­stead of (i.e. mak­ing im­plicit).

Defi­ni­tion 2

The func­tor is defined by

Defi­ni­tion 3

For any , we define by

Note that is a nat­u­ral trans­for­ma­tion from to the con­stant func­tor with value .

Given and s.t. for , we de­note .

Now, we are ready to define in­stru­men­tal states. We fix the sets and .

Defi­ni­tion 4

For any , we define , the space of time step in­stru­men­tal states, re­cur­sively by

Here, is a map­ping from to . The in­verse image of a con­vex set (in this case ) un­der an af­fine map­ping (in this case ) is also a con­vex set.

The se­man­tics of Defi­ni­tion 4 is as fol­lows. Any can be re­garded as a pair con­sist­ing of some (the image of un­der ) and a map­ping defined by . Given , is the prob­a­bil­ity dis­tri­bu­tion over ob­ser­va­tions re­sult­ing from tak­ing ac­tion in state , whereas is the state re­sult­ing from tak­ing ac­tion a in state con­di­tional on ob­serv­ing . This se­man­tics can be made more for­mal as fol­lows:

Defi­ni­tion 5

Given , we define re­cur­sively as fol­lows:

  1. For all , and : iff , and .

Defi­ni­tion 6

Given , and , and as­sum­ing that , we re­cur­sively define by

  1. For :

  2. For with some , and :

The point of defin­ing in this man­ner is that (i) differ­ent points in cor­re­spond to states that are truly not equiv­a­lent, i.e. can be em­piri­cally dis­t­in­guished (statis­ti­cally) and (ii) con­vex com­bi­na­tions of points in cor­re­spond pre­cisely to prob­a­bil­is­tic mix­tures. For­mally, we have:

Defi­ni­tion 7

Given and (a policy), we define (prob­a­bil­ity dis­tri­bu­tion over his­to­ries re­sult­ing from fol­low­ing policy in state ) by

Propo­si­tion 1

Con­sider some and as­sume that for any , . Then, .

Propo­si­tion 2

Con­sider some , and . Then

is a bounded poly­tope but in gen­eral it is *not} a sim­plex: we can­not re­gard it as just prob­a­bil­ity dis­tri­bu­tions over some set. For ex­am­ple, if , then it’s easy to see that is a square (one axis is the prob­a­bil­ity to get a given ob­ser­va­tion when tak­ing one ac­tion, the other axis is the prob­a­bil­ity for the other ac­tion). On the other hand, if then is a sim­plex: it is canon­i­cally iso­mor­phic to .

There are a nat­u­ral mor­phisms whose se­man­tics is for­get­ting about the be­hav­ior of the state at time step :

Defi­ni­tion 8

We define for any re­cur­sively. is the unique mor­phism from to . For any , is given by

We thereby get a se­quence

Defi­ni­tion 9

We define , the space of (in­finite time step) in­stru­men­tal states by

We de­note the canon­i­cal pro­jec­tions by .

Of course, can also be re­garded as the space of all pos­si­ble stochas­tic en­vi­ron­ments. Speci­fi­cally, we have:

Defi­ni­tion 10

For any we define by

Defi­ni­tion 11

Given , and , we define by

Like in the finite time case, we have

Defi­ni­tion 12

Given and , we define by

Propo­si­tion 3

Con­sider some and as­sume that for any , . Then, .

Propo­si­tion 4

Con­sider some , and . Then

For each , is finite-di­men­sional and there­fore has a nat­u­ral topol­ogy. also be­comes a topolog­i­cal space by equip­ping it with the in­verse limit topol­ogy. Since the are closed and bounded they are com­pact, and there­fore is also com­pact by Ty­chonoff’s the­o­rem. In the spe­cial case , and the in­verse limit topol­ogy is the weak topol­ogy for prob­a­bil­ity mea­sures, defined w.r.t. the product topol­ogy on .

We can now give the first defi­ni­tion of an in­stru­men­tal re­ward func­tion: a con­tin­u­ous af­fine func­tion . Why af­fine? A con­vex lin­ear com­bi­na­tion of in­stru­men­tal states is em­piri­cally in­dis­t­in­guish­able from a prob­a­bil­is­tic lot­tery. If we as­sign ex­pected val­ues to prob­a­bil­is­tic lot­ter­ies (as we should by the VNM the­o­rem), then we must also as­sign them to con­vex lin­ear com­bi­na­tions of in­stru­men­tal states: oth­er­wise our re­ward again de­pends on en­tirely un­ob­serv­able pa­ram­e­ters of our model.

An al­ter­na­tive ap­proach is to con­sider the no­tion of “ex­per­i­ment” ex­plic­itly.

We will use the no­ta­tion . Given a log­i­cal con­di­tion , the sym­bol will de­note 1 when is true and 0 when is false.

Defi­ni­tion 13

Given and , we define by

is said to be a ter­minable policy when for any

That is, a ter­minable policy is al­lowed to pro­duce a spe­cial to­ken which ter­mi­nates the “ex­per­i­ment” (in­stead of choos­ing an ac­tion), and we re­quire that, for any en­vi­ron­ment, this to­ken will be even­tu­ally pro­duced with prob­a­bil­ity 1.

Defi­ni­tion 14

Given a ter­minable policy and a bounded func­tion , we define the func­tion by

This gives us a sec­ond defi­ni­tion of in­stru­men­tal re­ward func­tions. In fact, the two defi­ni­tions are equiv­a­lent:

The­o­rem 1

For any ter­minable policy and bounded func­tion , is con­tin­u­ous and af­fine. Con­versely, for any con­tin­u­ous af­fine , there ex­ists a ter­minable policy and s.t. .

Put­ting the sec­ond part of the the­o­rem into words, for any in­stru­men­tal re­ward func­tion (in the sense of the first defi­ni­tion) there is some ex­per­i­ment the agent can do which yields an un­bi­ased es­ti­mate of the re­ward for the in­stru­men­tal state that ex­isted at the be­gin­ning of the ex­per­i­ment.

The range is not op­ti­mal, but for the re­gret bound in the next sub­sec­tion, it’s only im­por­tant that it is bounded by some fixed con­stant.

To illus­trate this con­cept of in­stru­men­tal re­ward func­tion, imag­ine that Clippy has ac­cess to a black box with a col­lec­tion of lev­ers on its out­side. Pul­ling the lev­ers pro­duces some sounds that hint at what hap­pens in­side the box, but are not enough to de­ter­mine it with cer­tainty. The box has a shut­tered glass win­dow, whose shut­ter can be opened from the out­side. Through the win­dow, Clippy can see a jum­ble of scrap metal, me­chan­i­cal ma­nipu­la­tors that are con­trol­led by the lev­ers (and can be used to shape the scrap metal), and also a few mice run­ning around the box and wreak­ing havoc. How­ever, it is not pos­si­ble to con­trol the ma­nipu­la­tors while the shut­ter is open. Worse, while open­ing the shut­ter al­lows see­ing a snap­shot of the shape of the metal, it also causes the ma­nipu­la­tors to move chaot­i­cally, ru­in­ing this shape. So, Clippy can ex­per­i­ment with the lev­ers and oc­ca­sion­ally open the shut­ter to test the re­sult. How­ever, in or­der to pro­duce and main­tain pa­per­clips in­side the box, the shut­ter has to be kept closed (and the pa­per­clips hid­den) most of the time.

It is also pos­si­ble to con­sider re­ward func­tions of the more gen­eral form , re­quired to be con­tin­u­ous and af­fine in the sec­ond ar­gu­ment. Such a re­ward func­tion de­pends both on the cur­rent (un­known) in­stru­men­tal state of the en­vi­ron­ment and the ob­serv­able his­tory so far. By The­o­rem 1, such a re­ward can be equiv­a­lently de­scribed in terms of a fam­ily of ter­minable poli­cies and a fam­ily of bounded func­tions s.t.

This means that, the value of re­ward can be es­ti­mated em­piri­cally, but only if the agent re­mem­bers the en­tire ob­ser­va­tion his­tory. If the his­tory is for­got­ten at some point, it might never be able to es­ti­mate the re­ward again. We will such re­ward func­tions “semi-in­stru­men­tal”.

Although semi-in­stru­men­tal re­ward func­tions are more gen­eral than in­stru­men­tal re­ward func­tions, I think that there is some in­ter­est in study­ing the nar­rower class. Ar­guably, in­stru­men­tal re­ward func­tions are a bet­ter model of what counts as “con­se­quen­tial­ist” or “goal-di­rected” be­hav­ior, since they de­pend only on the state of the en­vi­ron­ment. In­deed, it is easy to con­struct a semi-in­stru­men­tal re­ward func­tion that makes any given policy Bayes-op­ti­mal for any given prior, so semi-in­stru­men­tal re­ward func­tions are in­ca­pable (with­out fur­ther as­sump­tions) to dis­t­in­guish be­tween “in­tel­li­gent” and “un­in­tel­li­gent” be­hav­ior. On the other hand, op­ti­mal­ity w.r.t. some in­stru­men­tal re­ward func­tion seems like a stronger con­di­tion.

In or­der to de­rive a re­gret bound, we will re­strict at­ten­tion to those re­ward func­tions for which the ter­minable policy can be made to ter­mi­nate within time that has some finite and bounded ex­pected value. I don’t know an el­e­gant way to char­ac­ter­ize those re­ward func­tions in gen­eral, but we will de­scribe one class of such func­tions.

Defi­ni­tion 15

Con­sider any . Then, we can define the to­tal vari­a­tion dis­tance by

In gen­eral, is a pseu­do­met­ric. More­over, when is finite-di­men­sional and bounded, it is a metriza­tion of the nat­u­ral topol­ogy. For a finite set and , is just the usual to­tal vari­a­tion met­ric. For a ball of unit di­ame­ter in Eu­clidean space, is the Eu­clidean met­ric.

Defi­ni­tion 16

Con­sider any . We define the met­ric on by

For any , is a metriza­tion of the in­verse limit topol­ogy on .

Propo­si­tion 5

Con­sider any and af­fine and Lip­s­chitz with re­spect to . Then there is a ter­minable policy and a bounded func­tion s.t. and

Note that is a pa­ram­e­ter rem­i­nis­cent of ge­o­met­ric time dis­count that con­straints the shape of the re­ward func­tion. How­ever, in the re­gret anal­y­sis that fol­lows, it is not the same as the ac­tual ge­o­met­ric time dis­count pa­ram­e­ter . In par­tic­u­lar, we con­sider the asymp­totics in which the lat­ter ap­proaches 1, while the re­ward func­tion is as­sumed to be fixed. It might be in­ter­est­ing to study regimes in which both ap­proach 1, but we will not at­tempt it at pre­sent.

A Re­gret Bound for RL with In­stru­men­tal Rewards

Fix and . For any finite set and , there is a unique map­ping s.t.

That is, is just the in­stru­men­tal state cor­re­spond­ing to the POMDP state .

Given con­tin­u­ous af­fine, we get for any and the POMDP re­ward func­tion . Hence, one nat­u­ral set­ting for re­gret anal­y­sis is fix­ing and and con­sid­er­ing a hy­poth­e­sis class of tran­si­tion kernels

How­ever, re­gret anal­y­sis for POMDPs in­volves some tech­ni­cal com­plex­ity, and we pre­fer to start with a sim­pler set­ting. Hope­fully, we will ad­dress the POMDP case in a fol­lowup es­say.

Sup­pose that . Then, given any MPD tran­si­tion ker­nel , we can define the POMDP tran­si­tion ker­nel by

Fix con­tin­u­ous af­fine in the sec­ond ar­gu­ment (a type of semi-in­stru­men­tal re­ward func­tion). For any as above, we get the in­duced re­ward func­tion given by . We can now con­sider the learn­ing prob­lem cor­re­spond­ing to a hy­poth­e­sis class of MDP tran­si­tion kernels

It might seem odd to con­sider a set­ting with fully ob­serv­able states in the con­text of im­per­cep­ti­ble re­wards. How­ever, we can think about it as fol­lows: what we ob­serve is only a la­bel whose phys­i­cal mean­ing is un­known. Only given the (un­known) tran­si­tion ker­nel, such a la­bel can be in­ter­preted as as­signed a re­ward.

For ex­am­ple, imag­ine a robot tasked with sort­ing balls into bins ac­cord­ing to their weight. Some of the balls are solid gold and some of them are solid silver, so it’s in prin­ci­ple pos­si­ble to know their weight just by look­ing at them. How­ever, the robot doesn’t know a pri­ori whether silver or gold is heav­ier. On the other hand, the robot can perform some ex­per­i­ment with a ball to de­ter­mine its weight. In this situ­a­tion, the re­ward is im­per­cep­ti­ble even if the state (e.g. the lo­ca­tions, col­ors and sizes of the balls, the lo­ca­tions and shapes of the bins and the lo­ca­tion of the robot and the state of its ma­nipu­la­tors) is fully ob­serv­able.

Us­ing the con­cepts of MB di­men­sion and RVO di­men­sion we defined in a pre­vi­ous es­say, we can for­mu­late the re­gret bound.

The­o­rem 2

There is some s.t. the fol­low­ing holds.

Con­sider any finite non-empty sets and , func­tion con­tin­u­ous af­fine in the sec­ond ar­gu­ment , closed set and Borel prob­a­bil­ity mea­sure on (prior). Sup­pose is s.t. for any , is ter­minable, and is a bounded func­tion s.t.

Define the max­i­mal ex­pected re­ward es­ti­ma­tion time by

Define the max­i­mal bias span by

Define by

Denote and . Define the Bayesian re­gret by

Then, there is a fam­ily of poli­cies s.t.


More on the Re­gret Bound

It might ap­pear odd that the re­gret bound in The­o­rem 2 de­pends on the di­men­sions of the class of re­ward func­tions, but not on the di­men­sions on the class of tran­si­tion ker­nels, like in the per­cep­ti­ble case. The rea­son for this is that we only give the asymp­totic form. Asymp­tot­i­cally, the lead­ing term is and its co­effi­cient de­pends only on those pa­ram­e­ters that ap­pear in The­o­rem 2. How­ever, ex­am­in­ing the proof re­veals that there is also an term that has the same form as the re­gret bound in the per­cep­ti­ble case. For the sake of brevity and sim­plic­ity we will not at­tempt to write down a more pre­cise re­gret bound that re­flects the role of the di­men­sions of the class of tran­si­tion ker­nels, but in prin­ci­ple it is not difficult. We must keep in mind, how­ever, that in prac­tice the other term might be sig­nifi­cant: a pri­ori the di­men­sions of the class of tran­si­tion ker­nels are only bounded by a polyno­mial in and and the lat­ter might be ex­po­nen­tially big for re­al­is­tic mod­els.

The Death of the Agent and Kamikaze Strategies

There is one po­ten­tial ap­pli­ca­tion of defin­ing re­wards that are not a di­rect func­tion of ob­ser­va­tions which seems not sup­ported by in­stru­men­tal re­wards as we defined them. Namely, one can imag­ine agents that are mo­ti­vated to fol­low a cer­tain strat­egy that is de­signed to de­stroy the agent but pro­duce some de­sir­able re­sults in the ex­ter­nal world (“kamikaze strat­egy”). In other words, al­though sur­vival is a con­ver­gent in­stru­men­tal goal, it seems en­tirely rea­son­able for it to be sac­ri­ficed in cer­tain spe­cific sce­nar­ios. To give an ex­am­ple, imag­ine a body­guard robot whose pri­mary goal is keep­ing a cer­tain per­son al­ive. If as­sas­s­ins shoot at the per­son, and the only wait to stop them is for the robot to block the bul­lets with its body, then it can be the op­ti­mal strat­egy, even if it will de­stroy the robot and pre­vent it from con­tin­u­ing its func­tion as body­guard (as­sum­ing that e.g. it will give the per­son a chance to es­cape or some­one else a chance to stop the as­sas­s­ins).

Our for­mal­ism doesn’t di­rectly ad­dress the pos­si­bil­ity of the agent’s death, be­cause the se­quence of ac­tions and ob­ser­va­tions is as­sumed to be in­finite. One sim­ple way to ac­com­mo­date death is pos­tu­lat­ing a spe­cial ob­ser­va­tion s.t. it is never re­ceived be­fore death and always re­ceived af­ter death. If we do this, then death cor­re­sponds to a spe­cific in­stru­men­tal state and there­fore its re­ward is a fixed con­stant. This seems in­com­pat­i­ble with kamikaze strate­gies where the de­ci­sion of self-sac­ri­fice is con­tin­gent on con­di­tions in the ex­ter­nal world af­ter death.

Another ap­proach is as­sum­ing the agent be­comes a “ghost”: it keeps re­ceiv­ing ob­ser­va­tions which re­flect the ac­tual phys­i­cal world but its ac­tions have no effect. Such ghosts might the­o­ret­i­cally be pro­duced by a sim­plic­ity prior: for ex­am­ple, if the agent is an AI con­nected to a cam­era that mon­i­tors a room, then we can imag­ine a vir­tual video of the same room con­tin­u­ing be­yond the AI’s shut­down. This al­lows for differ­ent in­stru­men­tal states af­ter death and can po­ten­tially jus­tify kamikaze strate­gies, but it seems like a very frag­ile con­struct and is un­likely to guaran­tee rea­son­able be­hav­ior.

The prob­lem with death can be viewed from an­other per­spec­tive: our re­gret anal­y­sis as­sumes a no-traps con­di­tion () whereas death is usu­ally a trap. There­fore, to guaran­tee ra­tio­nal be­hav­ior while ac­count­ing for death, we need to op­er­ate within a frame­work that al­lows for traps.

One such frame­work is re­quiring Bayes-op­ti­mal­ity and giv­ing up on learn­abil­ity. This seems both too weak (be­cause noth­ing is guaran­teed for spe­cific en­vi­ron­ments) and too strong (be­cause it’s com­pu­ta­tion­ally in­tractable). That said, I think this ap­proach can be sal­vaged by re­quiring some­thing some­what weaker than Bayes-op­ti­mal­ity and prov­ing some­thing some­what weaker than learn­abil­ity (hope­fully I will write on that in a fu­ture es­say). In any case, once we give up on learn­abil­ity we can al­low un­ob­serv­able re­wards (the gen­eral POMDP set­ting in the be­gin­ning of the Re­sults sec­tion) which al­low han­dling death and kamikaze strate­gies eas­ily. Speci­fi­cally, we can have a plethora of “dead” states that pro­duce only the ob­ser­va­tion and whose tran­si­tions do no de­pend on the ac­tion, but which have differ­ent re­wards. So, this ap­proach “solves” the prob­lem but at a high cost: no learn­abil­ity or re­gret anal­y­sis.

Another frame­work for deal­ing with traps is Del­ega­tive Re­in­force­ment Learn­ing (DRL). In DRL the agent in­ter­acts with an ex­ter­nal ad­vi­sor, and is thereby able to suc­cess­fully nav­i­gate all traps that the ad­vi­sor knows about. In other words, it con­verges to a policy that is Bayes-op­ti­mal with re­spect to the be­lief state of the ad­vi­sor (while the ad­vi­sor is not able to pro­duce such a policy by it­self). Com­bin­ing DRL with the in­stru­men­tal re­wards for­mal­ism should al­low ac­count­ing for death. Speci­fi­cally, in any given POMDP hy­poth­e­sis, the state space will be de­com­posed as , with the fol­low­ing as­sump­tions:

  • The tran­si­tion ker­nel on states in doesn’t de­pend on the ac­tion.

  • is in­var­i­ant un­der the dy­nam­ics (death is ir­re­versible).

  • The re­ward func­tion on can be ar­bi­trary.

  • The re­ward func­tion on has to fac­tor through (i.e. de­pend only on the in­stru­men­tal state), and more­over the policy for es­ti­mat­ing the re­ward func­tion is known to be safe (al­ter­na­tively, we might let the agent learn a safe es­ti­ma­tion policy from some space of can­di­date poli­cies).

Un­der these as­sump­tions (or some var­i­ant thereof), it seems plau­si­ble that we should be able to prove learn­abil­ity and de­rive a rea­son­able re­gret bound. For­mal­iz­ing this line of thought is a task for fu­ture work.

The DRL ap­proach to kamikaze strate­gies also has im­pli­ca­tions on cor­rigi­bil­ity. In­deed, if the ex­ter­nal ad­vice is com­pat­i­ble with an ad­mis­si­ble re­ward func­tion that recom­mends shut­ting down the agent (i.e. upon del­e­ga­tion, the ad­vi­sor acts to shut down the agent) then a DRL agent will as­sist with its own shut­down. This prop­erty is pre­served by sub­agents, since cre­at­ing a sub­agent is an­other po­ten­tially ir­re­versible ac­tion (and there­fore must be vet­ted by the ad­vi­sor).

Also, it is tempt­ing to spec­u­late about DRL as a model of hu­man self-sac­ri­fice. We have already spec­u­lated be­fore that we can view DRL as model of hu­man learn­ing, where the hu­man’s so­cial en­vi­ron­ment or hu­man­ity as a whole is re­garded as play­ing the role of the ad­vi­sor. Similarly, hu­mans that sac­ri­fice their lives for the sake of their loved ones or the “greater good” can be re­garded as heed­ing that “ad­vise” re­gard­ing the re­wards of states in which the hu­man doesn’t ex­ist. Un­der this model, we can make sense of self-sac­ri­fices by in­di­vi­d­ual hu­mans but not of hy­po­thet­i­cal self-sac­ri­fice by hu­man­ity as a whole (po­ten­tially aug­mented by other minds with which we could com­mu­ni­cate and find com­mon ground), but, the lat­ter re­stric­tion seems com­pat­i­ble with in­tu­ition.

Spec­i­fy­ing In­stru­men­tal Re­ward Functions

It might be use­ful to find nat­u­ral ways to spec­ify in­stru­men­tal re­ward func­tions. In par­tic­u­lar, it is in­ter­est­ing to start with a re­ward func­tion defined on a spe­cific on­tol­ogy (POMDP) and some­how ex­tend it to a full in­stru­men­tal re­ward func­tion (which, like we said be­fore, in­duces a re­ward func­tion on any on­tol­ogy via the map­ping).

De Blanc sug­gests one way to ex­tend re­ward func­tions from one POMDP to an­other, but I don’t know whether this op­er­a­tion leads to an in­stru­men­tal re­ward func­tion, i.e. whether it is com­pat­i­ble with the con­straints im­posed by af­fine de­pen­den­cies be­tween the states (and I don’t see any rea­son why it should).

Essen­tially, what we’re look­ing for is a way to ex­tend an af­fine func­tion from an af­fine sub­space to the en­tire af­fine space (the af­fine sub­space is the af­fine span of the in­stru­men­tal states cor­re­spond­ing to the states of the ini­tial on­tol­ogy; note that, if these in­stru­men­tal states are not af­fine in­de­pen­dent, then we have some con­straint on this ini­tial re­ward func­tion). One nat­u­ral way to do it is look­ing for an af­fine func­tion whose differ­en­tial (the lin­ear func­tion part) has min­i­mal norm, or choos­ing some priv­ileged pro­jec­tion op­er­a­tor from the en­tire space to the af­fine sub­space. How­ever, since the nat­u­ral norms we have here are not in­ner product norms, these choices are usu­ally non-unique, and it’s pos­si­ble we can’t get much out of it.

A differ­ent nat­u­ral ap­proach is us­ing Oc­cam’s ra­zor, i.e. look­ing for an ex­ten­sion of min­i­mal de­scrip­tion length or the ex­pected value of a ran­dom ex­ten­sion sam­pled from a con­di­tional sim­plic­ity prior. This re­quires a nat­u­ral way to de­scribe in­stru­men­tal re­ward func­tions. We do have such a way: as­sum­ing that the and guaran­teed to ex­ist by The­o­rem 1 can be cho­sen to be com­putable, the de­scrip­tion is the pro­gram for those ob­jects with re­spect to a fixed uni­ver­sal Tur­ing ma­chine (we con­sider a sin­gle pro­gram that com­putes both and to ex­ploit pos­si­ble mu­tual in­for­ma­tion be­tween the two).

Th­ese ques­tions are left for fu­ture re­search.


Proof of Propo­si­tion 1

We pro­ceed by in­duc­tion on . For there is noth­ing to prove since so a pri­ori. Now, sup­pose .

We need to show that for any and , . To show it for given and , we con­sider some s.t. . Since , we have

By Defi­ni­tion 7, we get,

By Defi­ni­tion 6, we get,

If this value van­ishes then . Other­wise, con­sider any and define by

Defi­ni­tions 6 and 7 im­ply that

and similarly for . We have and , which im­plies . By the in­duc­tion hy­poth­e­sis, we get . Com­bin­ing this with , we get .

Propo­si­tion A.1

Con­sider any , and . As­sume that


Here, is un­der­stood to mean when and the same for .

Proof of Propo­si­tion A.1

Ob­vi­ous from the defi­ni­tions.

Proof of Propo­si­tion 2

Given and some , to show that it is suffi­cient to show that

  1. For any and ,

  1. For any and , if then for any

We will use this to prove the claim by in­duc­tion on . For , there is noth­ing to prove. For , by Defi­ni­tion 7

By Defi­ni­tion 6

Fur­ther­more, Defi­ni­tions 6 and 7 im­ply that

Here, is defined by .


By Propo­si­tion A.1

We get

By the in­duc­tion hy­poth­e­sis, we get

De­com­pos­ing the right hand side and ap­ply­ing the same rea­son­ing in re­verse,

By Defi­ni­tion 6, can be writ­ten as

Mul­ti­ply­ing the nu­mer­a­tor and de­nom­i­na­tor by , we get

Com­bin­ing this with the pre­vi­ous iden­tity, we get

Given and defined by , we will de­note .

Propo­si­tion A.2

Con­sider any and . Then,

Proof of Propo­si­tion A.2

Ob­vi­ous from the defi­ni­tions.

Propo­si­tion A.3

Con­sider any ,