Delegative Inverse Reinforcement Learning

We in­tro­duce a re­in­force­ment-like learn­ing set­ting we call Del­ega­tive In­verse Re­in­force­ment Learn­ing (DIRL). In DIRL, the agent can, at any point of time, del­e­gate the choice of ac­tion to an “ad­vi­sor”. The agent knows nei­ther the en­vi­ron­ment nor the re­ward func­tion, whereas the ad­vi­sor knows both. Thus, DIRL can be re­garded as a spe­cial case of CIRL. A similar set­ting was stud­ied in Clouse 1997, but as far as we can tell, the rele­vant liter­a­ture offers few the­o­ret­i­cal re­sults and vir­tu­ally all re­searchers fo­cus on the MDP case (please cor­rect me if I’m wrong). On the other hand, we con­sider gen­eral en­vi­ron­ments (not nec­es­sar­ily MDP or even POMDP) and prove a nat­u­ral perfor­mance guaran­tee.

The use of an ad­vi­sor al­lows us to kill two birds with one stone: learn­ing the re­ward func­tion and safe ex­plo­ra­tion (i.e. avoid­ing both the Scylla of “Bayesian para­noia” and the Charyb­dis of fal­ling into traps). We prove that, given cer­tain as­sump­tion about the ad­vi­sor, a Bayesian DIRL agent (whose prior is sup­ported on some countable set of hy­pothe­ses) is guaran­teed to at­tain most of the value in the slow fal­ling time dis­count (long-term plan­ning) limit (as­sum­ing one of the hy­pothe­ses in the prior is true). The as­sump­tion about the ad­vi­sor is quite strong, but the ad­vi­sor is not re­quired to be fully op­ti­mal: a “soft max­i­mizer” satis­fies the con­di­tions. More­over, we al­low for the ex­is­tence of “cor­rupt states” in which the ad­vi­sor stops be­ing a rele­vant sig­nal, thus demon­strat­ing that this ap­proach can deal with wire­head­ing and avoid ma­nipu­lat­ing the ad­vi­sor, at least in prin­ci­ple (the as­sump­tion about the ad­vi­sor is still un­re­al­is­ti­cally strong). Fi­nally we con­sider ad­vi­sors that don’t know the en­vi­ron­ment but have some be­liefs about the en­vi­ron­ment, and show that in this case the agent con­verges to Bayes-op­ti­mal­ity w.r.t. the ad­vi­sor’s be­liefs, which is ar­guably the best we can ex­pect.

All the proofs are in the Ap­pendix.


The set of nat­u­ral num­bers is defined to be­gin from 0. Given , de­notes the set . Given a log­i­cal for­mula , de­notes its truth value.

Given a set , we de­note , the set of finite strings over the alpha­bet . The unique string of length 0 is de­noted . We also de­note by the set of in­finite strings over alpha­bet . Given and , is the -th sym­bol in (i.e. ) and is the pre­fix of length of (end­ing in ). Given , is the length of and is the con­cate­na­tion of and . The lat­ter no­ta­tion is also ap­pli­ca­ble when . The no­ta­tion means that is a pre­fix of . Given sets , and , we some­times use the no­ta­tion . Given , and , is defined by where .

Given sets and , the no­ta­tion means that is a par­tial map­ping from to , i.e. a map­ping from some set to .

Given a topolog­i­cal space , is the space of Borel prob­a­bil­ity mea­sures on . When no topol­ogy is speci­fied, is un­der­stood to be dis­crete: in this case, can also be re­garded as a func­tion from to . The space is un­der­stood to have the product topol­ogy. Given topolog­i­cal spaces , and , is the sup­port of and is the product mea­sure. Given a Markov ker­nel from to , is the semidi­rect product of and . When and are dis­crete, is the Shan­non en­tropy of (in nat­u­ral base) and is the Kul­lback-Leibler di­ver­gence from to .

Given and , we use the notation

The sym­bols will re­fer to usual -no­ta­tion.


An in­ter­face is a pair of finite sets (“ac­tions” and “ob­ser­va­tions”). An -policy is a func­tion . An -en­vi­ron­ment is a par­tial func­tion s.t.


ii. Given and , iff and .

It is easy to see that is always of the form for some . We de­note .

Given an -policy and an -en­vi­ron­ment , we get in the usual way.

An -re­ward func­tion is a par­tial func­tion . An -uni­verse is a pair where is an -en­vi­ron­ment and is an -re­ward func­tion s.t. . We de­note the space of -uni­verses by . Given an -re­ward func­tion and , we have the as­so­ci­ated util­ity func­tion defined by

Here and through­out, we use ge­o­met­ric time dis­count, how­ever this choice is mostly for no­ta­tional sim­plic­ity. More or less all re­sults carry over to other shapes of the time dis­count func­tion.

Denote \ the space of -poli­cies. An -metapolicy is a fam­ily , where the pa­ram­e­ter is thought of as set­ting the scale of the time dis­count. An -meta-uni­verse is a fam­ily . This lat­ter con­cept is use­ful for an­a­lyz­ing multi-agent sys­tems, where the en­vi­ron­ment con­tains other agents and we study the asymp­totics when all agents’ time dis­count scales go to in­finity. We won’t fo­cus on the multi-agent case in this es­say, but for fu­ture refer­ence, it seems use­ful to make sure the re­sults hold in the greater gen­er­al­ity of meta-uni­verses.

Given an -policy , an -uni­verse and , we de­note (this is well-defined since is defined on the sup­port of ). We also de­note . We will omit when it is ob­vi­ous from the con­text.

Defi­ni­tion 1

Fix an in­ter­face . Con­sider a metapolicy and a set of meta-uni­verses. is said to learn when for any

is said to be learn­able when there ex­ists that learns .

Our no­tion of learn­abil­ity is closely re­lated to the no­tion of sub­lin­ear re­gret, as defined in Leike 2016, ex­cept that we al­low the policy to ex­plic­itly de­pend on the time dis­count scale. This differ­ence is im­por­tant: for ex­am­ple, given a sin­gle uni­verse , it might be im­pos­si­ble to achieve sub­lin­ear re­gret, but is always learn­able.

Propo­si­tion 1

Fix an in­ter­face . Con­sider a countable learn­able set of meta-uni­verses. Con­sider any s.t. . Con­sider a -Bayes op­ti­mal metapolicy, i.e.

Then, learns .

Propo­si­tion 1 can be re­garded as a “fre­quen­tist” jus­tifi­ca­tion for Bayesian agents: if any metapolicy is op­ti­mal in a “fre­quen­tist” sense for the class (i.e. learns it), then the Bayes op­ti­mal metapolicy is such.

Another handy prop­erty of learn­abil­ity is the fol­low­ing.

Propo­si­tion 2

Fix an in­ter­face . Let be a countable set of meta-uni­verses s.t. any finite is learn­able. Then, is learn­able.

We now in­tro­duce the for­mal­ism needed to dis­cuss ad­vi­sors. Define , and . Here, the fac­tor of is the ac­tion taken by the ad­vi­sor, as­sumed to be ob­serv­able by the agent. The en­vi­ron­ments we will con­sider are s.t. this ac­tion is un­less the agent del­e­gated to the ad­vi­sor at this point of time, which is speci­fied by the agent tak­ing ac­tion . It will also be the case that when­ever the agent takes ac­tion , the ad­vi­sor can­not take ac­tion .

Denote . Given , we define by

Given , we define by .

Defi­ni­tion 2

An -policy is said to be au­tonomous when for any , .

Con­sider an -en­vi­ron­ment and an au­tonomous -policy , which we think of as the ad­vi­sor policy. We define the -en­vi­ron­ment as fol­lows. For any s.t. , and :

It is easy to the above is a well-defined -en­vi­ron­ment with .

Given an -uni­verse , we define the -re­ward func­tion by and the -uni­verse .

We now in­tro­duce the con­di­tions on the ad­vi­sor policy which will al­low us to prove a learn­abil­ity the­o­rem. First, we spec­ify an ad­vi­sor that always re­mains “ap­prox­i­mately ra­tio­nal.”

The no­ta­tion will be used to mean . Given a uni­verse and we define and by

Defi­ni­tion 3

Fix an in­ter­face . Con­sider a uni­verse . Let . A policy is called strictly -ra­tio­nal for when for any and

Now we deal with the pos­si­bil­ity of the ad­vi­sor be­com­ing “cor­rupt”. In prac­ti­cal im­ple­men­ta­tions where the “ad­vi­sor” is a hu­man op­er­a­tor, this can cor­re­spond to sev­eral types of events, e.g. sab­o­tag­ing the chan­nel that trans­mits data from the op­er­a­tor to the AI (“wire­head­ing”), ma­nipu­la­tion of the op­er­a­tor or re­place­ment of the op­er­a­tor by a differ­ent en­tity.

Defi­ni­tion 4

Fix an in­ter­face . Con­sider a fam­ily s.t. for any , if then . We think of as the set of his­to­ries in which a cer­tain event oc­curred. Con­sider a meta-uni­verse . is said to be a -avoid­able event when there is a meta-policy and s.t.




That is, is -avoid­able when it is pos­si­ble to avoid the event for a long time while re­tain­ing most of the value. Con­sider a meta-uni­verse and a -avoid­able event. Denote . We define the re­ward func­tion by

We think of as rep­re­sent­ing a pro­cess wherein once the event rep­re­sented by oc­curs, the agent starts min­i­miz­ing the util­ity func­tion. We also use the no­ta­tion .

Defi­ni­tion 5

Con­sider a meta-uni­verse and , where we think of the ar­gu­ment of the func­tion as the time dis­count scale. An au­tonomous -metapolicy is called -ra­tio­nal for when there ex­ists a -avoid­able event (that we think of as the ad­vi­sor be­com­ing “cor­rupt”) and an au­tonomous -metapolicy (rep­re­sent­ing ad­vi­sor policy con­di­tional on non-cor­rup­tion) s.t.

i. For any s.t. , .

ii. is strictly -ra­tio­nal for .

Our defi­ni­tion of -ra­tio­nal­ity re­quires the ad­vi­sor to be ex­tremely averse to cor­rup­tion: the ad­vi­sor be­haves as if, once a cor­rup­tion oc­curs, the agent policy be­comes the worst pos­si­ble. In gen­eral, this seems much too strong: by the time cor­rup­tion oc­curs, the agent might have already con­verged into ac­cu­rate be­liefs about the uni­verse that al­low it to de­tect the cor­rup­tion and keep op­er­at­ing with­out the ad­vi­sor. Even bet­ter, the agent can usu­ally out­perform the worst pos­si­ble policy us­ing the prior alone. More­over, we can al­low for cor­rup­tion to de­pend on un­ob­serv­able ran­dom events and differ­en­ti­ate be­tween differ­ent de­grees of cor­rup­tion and treat them ac­cord­ingly. We leave those fur­ther im­prove­ments for fu­ture work.

We are now ready to for­mu­late the main re­sult.


Con­sider a countable fam­ily of -meta-uni­verses and s.t. . Let be a fam­ily of au­tonomous -metapoli­cies s.t. for ev­ery , is -ra­tio­nal for . Define . Then, is learn­able.

Some re­marks:

  • By Propo­si­tion 1, is learned by any Bayes op­ti­mal metapolicy with prior sup­ported on .

  • To get a feel­ing for the con­di­tion , con­sider an en­vi­ron­ment where the re­ward de­pends only on the last ac­tion and ob­ser­va­tion. In such an en­vi­ron­ment, an ad­vi­sor that performs soft­max (with con­stant pa­ram­e­ter) on the next re­ward has . It is thus “more ra­tio­nal” than the re­quired min­i­mum.

  • It is easy to see that the The­o­rem can be gen­er­al­ized by in­tro­duc­ing an ex­ter­nal penalty (nega­tive re­ward) for each time the agent del­e­gates to the ad­vi­sor: as it is, us­ing the ad­vi­sor already car­ries a penalty due to its sub­op­ti­mal choice.

The con­di­tions of the The­o­rem im­ply that, in some sense, the ad­vi­sor “knows” the true en­vi­ron­ment. This is un­re­al­is­tic: ob­vi­ously, we ex­pect the hu­man op­er­a­tor to have some (!) un­cer­tainty about the world. How­ever, we clearly can­not do away with this as­sump­tion: if the same ac­tion trig­gers a trap in some uni­verses and is nec­es­sary for ap­proach­ing max­i­mal util­ity in other uni­verses, and there is no ob­serv­able differ­ence be­tween the uni­verses be­fore the ac­tion is taken, then there is no way to guaran­tee op­ti­mal­ity. The prior knowl­edge you have about the uni­verse caps the util­ity you can guaran­tee to ob­tain. On the other hand, as an AI de­signer, one can rea­son­ably ex­pect the AI to do at least as well as pos­si­ble us­ing the de­signer’s own knowl­edge. If run­ning the AI is the de­signer’s best strat­egy, the AI should be close to Bayes op­ti­mal (in some sense that in­cludes com­pu­ta­tional bounds et cetera: com­pli­ca­tions that we cur­rently ig­nore) with re­spect to the de­signer’s pos­te­rior rather than with re­spect to some sim­ple prior. In other words, we need a way to trans­mit the de­signer’s knowl­edge to the AI, with­out hand-craft­ing an elab­o­rate prior.

The fol­low­ing shows that the DIRL achieves this goal (the­o­ret­i­cally, given the con­sid­er­able sim­plify­ing as­sump­tions).

Given an en­vi­ron­ment , we define as fol­lows. For

For , .

Given a fam­ily of en­vi­ron­ments and , we will use the no­ta­tion to de­note the en­vi­ron­ment given by

Corol­lary 1

Con­sider a countable fam­ily of -meta-en­vi­ron­ments, a countable fam­ily of re­ward func­tions and s.t. given , . We think of as the ad­vi­sor’s be­lief about the en­vi­ron­ment in uni­verse . Let be s.t. and be a fam­ily of au­tonomous -metapoli­cies s.t. for ev­ery , is -ra­tio­nal for . Let be s.t. for any

We think of as the agent’s prior and the equa­tion above as stat­ing the agent’s be­lief that the ad­vi­sor’s be­liefs are “cal­ibrated”. Con­sider a -Bayes op­ti­mal -metapolicy, i.e.

Then, for ev­ery

If we hap­pen to be so lucky that the ad­vi­sor’s (pre­sum­ably jus­tified) be­lief is sup­ported on a learn­able en­vi­ron­ment class, we get a stronger con­clu­sion.

Corol­lary 2

In the set­ting of Corol­lary 1, fix . Define the set of meta-uni­verses by

As­sume is learn­able. Then, for ev­ery

We also be­lieve that to some ex­tent DIRL is effec­tive against acausal at­tack. In­deed, the op­ti­mal­ity we get from the The­o­rem + Propo­si­tion 1 holds for any prior. How­ever, the speed of con­ver­gence to op­ti­mal­ity cer­tainly de­pends on the prior. It is there­fore de­sir­able to an­a­lyze this de­pen­dency and bound the dam­age an ad­ver­sary can do by con­trol­ling a cer­tain por­tion of the prior. We leave this for fu­ture work.


Propo­si­tion A.0

Fix an in­ter­face . Con­sider a countable learn­able set of meta-uni­verses. Con­sider any s.t. . Con­sider a metapolicy s.t.

Then, learns .

Proof of Propo­si­tion A.0

Fix a metapolicy that learns . Con­sider and let be finite s.t. . For and ev­ery we have


Com­bin­ing, we get

By defi­ni­tion of , this implies

For any , we get

Tak­ing to 0, we get the de­sired re­sult.

Proof of Propo­si­tion 1

Im­me­di­ate from Propo­si­tion A.0.

Proof of Propo­si­tion 2

Let . For each , let learn . Choose s.t.




iv. For any and , .

Now define . Clearly, learns .

Propo­si­tion A.1

Con­sider and . Then

Proof of Propo­si­tion A.1

Without loss of gen­er­al­ity, as­sume . For any , we have

Denote . Ob­vi­ously, and there­fore and . We get

Since , yield­ing the de­sired re­sult.

Propo­si­tion A.2

Con­sider a finite set, , and . As­sume that

i. For any , .



Proof of Propo­si­tion A.2

Without loss of gen­er­al­ity, we can as­sume that , be­cause oth­er­wise we can rescale by a con­stant in which will only make larger. It now fol­lows from con­di­tions i+ii that for any , and there­fore . We have

Since is a con­vex func­tion, we get

Propo­si­tion A.3

Con­sider and finite sets, , , and . As­sume that

i. For any , .

ii. For any , .

iii. For any and , .

Define by . Then, the mu­tual in­for­ma­tion be­tween and in the dis­tri­bu­tion satisfies

Proof of Propo­si­tion A.3

Define by . We have

Ap­ply­ing Pinsker’s inequality

By Propo­si­tion A.1

By con­di­tion iii, and therefore

Ap­ply­ing Propo­si­tion A.2, we get

Propo­si­tion A.4

Con­sider a finite set, and and s.t. for any

Then, .

Proof of Propo­si­tion A.4

We have

We com­pute :

Propo­si­tion A.5

Con­sider a uni­verse , a policy and . Then,

Proof of Propo­si­tion A.5

For any , it is easy to see that

Tak­ing ex­pected value over , we get

It is easy to see that the sec­ond term van­ishes, yield­ing the de­sired re­sult.

Lemma A

Con­sider the set­ting of The­o­rem, but as­sume that for some (i.e. it is finite) and that is strictly -ra­tio­nal for . Denote and . Denote the uniform prob­a­bil­ity dis­tri­bu­tion. For any , define re­cur­sively as follows

In the above, and are nor­mal­iza­tion fac­tor cho­sen to make the prob­a­bil­ities sum to 1. That is, is ob­tained by start­ing from prior , up­dat­ing on ev­ery ob­ser­va­tion, and set­ting to 0 the prob­a­bil­ity of any uni­verse whose prob­a­bil­ity drops be­low . When en­coun­ter­ing an “im­pos­si­ble” ob­ser­va­tion we re­set to the uniform dis­tri­bu­tion, but this is ar­bi­trary.

Define the “loss func­tion” by

Denote . Define the fol­low­ing -metapolicy :

(Tech­ni­cally, we only defined for , but it doesn’t mat­ter.) Then, learns .

Proof of Lemma A

For ev­ery , we define and re­cur­sively as follows