The reward engineering problem

To­day we usu­ally train re­in­force­ment learn­ing agents to perform nar­row tasks with sim­ple goals. We may even­tu­ally want to train RL agents to be­have “well” in open-ended en­vi­ron­ments where there is no sim­ple goal.

Sup­pose that we are try­ing to train an RL agent A. In each epi­sode, A in­ter­acts with an en­vi­ron­ment, pro­duc­ing a tran­script τ. We then eval­u­ate that tran­script, pro­duc­ing a re­ward r ∈ [0, 1]. A is trained is to max­i­mize its re­ward.

We would like to set up the re­wards so that A will learn to be­have well — that is, such that if A learns to re­ceive a high re­ward, then we will be happy with A’s be­hav­ior.

To make the prob­lem fea­si­ble, we as­sume that we have ac­cess to an­other agent H which

  1. is “smarter” than A, and

  2. makes “good” de­ci­sions.

In or­der to eval­u­ate tran­script τ, we al­low our­selves to make any num­ber of calls to H, and to use any other tools that are available. The ques­tion is: how do we carry out the eval­u­a­tion, so that the op­ti­mal strat­egy for A is to also make “good” de­ci­sions?

Fol­low­ing Daniel Dewey, I’ll call this the re­ward en­g­ineer­ing prob­lem.

Note that our eval­u­a­tion pro­cess may be quite ex­pen­sive, and ac­tu­ally im­ple­ment­ing it may be in­fea­si­ble. To build a work­ing sys­tem, we would need to com­bine this eval­u­a­tion with semi-su­per­vised RL and learn­ing with catas­tro­phes.

Pos­si­ble ap­proaches and re­main­ing problems

I know of 3 ba­sic ap­proaches to re­ward en­g­ineer­ing:

  1. Direct su­per­vi­sion. Use H to eval­u­ate A’s be­hav­ior, and train A to max­i­mize H’s eval­u­a­tions. In some con­texts we could com­pare two be­hav­iors in­stead of eval­u­at­ing one in iso­la­tion.

  2. Imi­ta­tion learn­ing. Use H to gen­er­ate a bunch of tran­scripts, and train Ato pro­duce similar-look­ing tran­scripts. For ex­am­ple, we could train a model to dis­t­in­guish A’s be­hav­ior from H’s be­hav­ior, and re­ward A when it fools the dis­t­in­guisher.

  3. In­verse re­in­force­ment learn­ing. Use H to gen­er­ate a bunch of tran­scripts, and then in­fer a re­ward func­tion which is be­ing ap­prox­i­mately op­ti­mized by H. Use this re­ward func­tion to eval­u­ate A’s be­hav­ior.

All of these ap­proaches are promis­ing but face sig­nifi­cant challenges. I’ll de­scribe some of these prob­lems in the next 3 sec­tions.

1. Direct supervision

In di­rect su­per­vi­sion, H looks at a tran­script of A’s be­hav­ior, and es­ti­mates how good that tran­script is.

To see the prob­lem with this scheme, sup­pose that A has been asked to draw a pic­ture, and A does it by copy­ing an ex­ist­ing pic­ture with some mod­ifi­ca­tions. If origi­nal­ity is es­pe­cially im­por­tant, then this may be a very “bad” policy. But even if H is much smarter than A, it may be hard to tell that the pic­ture is not origi­nal — cre­at­ing a deriva­tive work only re­quires look­ing at a sin­gle ex­ist­ing pic­ture, while check­ing if a work is deriva­tive re­quires con­sid­er­ing ev­ery pic­ture.

More for­mally: in or­der for di­rect su­per­vi­sion to be effec­tive, H needs to be bet­ter-in­formed than A about what is “good.” If this con­di­tion is satis­fied, then from A’s per­spec­tive, es­ti­mat­ing H’s es­ti­mate of good­ness is equiv­a­lent to es­ti­mat­ing ac­tual good­ness. This con­di­tion is su­perfi­cially plau­si­ble — af­ter all, we did as­sume that H is smarter than A.

The prob­lem is that when A picks an ac­tion, A is es­pe­cially well-in­formed about that ac­tion — the com­pu­ta­tion which pro­duced the ac­tion pro­vides ev­i­dence about it, and H may not have ac­cess to that ev­i­dence.


One re­sponse is to let H see how A com­puted its ac­tion. If H can un­der­stand that pro­cess, then H may be able to effec­tively eval­u­ate the ac­tion.

Some­times this is straight­for­ward: for ex­am­ple, if A uses an at­ten­tion mechanism to look at a par­tic­u­lar paint­ing and copy it, we can sim­ply tell Hwhat A looked at.

In other con­texts it might be much more challeng­ing. For ex­am­ple, A may copy an image by “mem­o­riz­ing” the image in its weights, rather than by “look­ing at” the image at test time.

One ap­proach to these challeng­ing cases is to train A to pro­duce ac­tions andto pro­duce ex­pla­na­tions that are max­i­mally helpful for eval­u­at­ing those ac­tions. For ex­am­ple, we may train A to pro­duce images and to point out similar­i­ties be­tween its out­put and train­ing images.

This is a challeng­ing prob­lem for sev­eral rea­sons. One is­sue is that pro­duc­ing these ex­pla­na­tions in­volves a huge “ac­tion space” and a very com­plex ob­jec­tive. A more sub­tle prob­lem is that there are two con­flict­ing ob­jec­tives: Awants to pro­duce ac­tions that H eval­u­ates as “good,” but pro­vid­ing use­ful in­for­ma­tion will some­times lead H to pro­duce a lower eval­u­a­tion. Train­ing Ato do both tasks re­quires a new ap­proach.

Other problems

We can imag­ine other failure modes of di­rect su­per­vi­sion. For ex­am­ple, A may find an ac­tion that ex­ploits one of H’s bi­ases or blind spots in or­der to re­ceive a high rat­ing.

We hope that these “at­tacks” can only suc­ceed if H is ig­no­rant about the pro­cess that pro­duced a given ac­tion, and so can be re­solved by what­ever form of trans­parency al­lows H to ac­cu­rately eval­u­ate A’s ac­tions in gen­eral.

That is, if A care­fully ex­plains to H how an ac­tion was cho­sen to ex­ploit H’s bi­ases, then H can hope­fully avoid be­ing ex­ploited. This seems es­pe­cially plau­si­ble given that H is smarter than A.

2. Imi­ta­tion learning

Imi­ta­tion learn­ing has two con­cep­tual prob­lems:

  • If H is more com­pe­tent than A, then A will gen­er­ally be un­able to imi­tate H’s be­hav­ior.

  • We don’t have a to­tally satis­fac­tory frame­work for re­duc­ing imi­ta­tion learn­ing to an op­ti­miza­tion prob­lem.

What if A can’t imi­tate H?

Sup­pose that A has been asked to build a block tower. H can quickly stack the blocks, and 99% of the time the tower stays stand­ing; 1% of the time Hmesses up and the tower falls down. A is not as ca­pa­ble as H, and so if it tries to stack the blocks quickly the tower falls down 100% of the time.

The “best” be­hav­ior for A may be to stack the blocks more slowly, so that the tower can stay stand­ing. But this be­hav­ior is hard to in­duce with imi­ta­tion learn­ing, be­cause H never stacks the blocks slowly. In­stead, an imi­ta­tion learner is more likely to try to stack the blocks quickly and fail (since at least H does this 1% of the time).

One re­sponse to this prob­lem is to have H “dumb down” its be­hav­ior so that it can be copied by A.

How­ever, this pro­cess may be challeng­ing for H. Find­ing a way to do a task which is within A’s abil­ities may be much harder than sim­ply do­ing the task — for ex­am­ple, it may re­quire a deep un­der­stand­ing of A’s limi­ta­tions and ca­pa­bil­ities.

I’ve pro­posed a pro­ce­dure, “meet­ing halfway,” for ad­dress­ing this prob­lem. The idea is that we train a dis­crim­i­na­tor to dis­t­in­guish H’s be­hav­ior from A’s be­hav­ior, and use the dis­crim­i­na­tor’s out­put to help H be­have in an “A-like” way. This pro­posal faces many challenges, and it’s not at all clear if it can work.

How do you train an imi­ta­tor?

The pla­gia­rism ex­am­ple from the last sec­tion is also a challenge for imi­ta­tion learn­ing. Sup­pose that A has been asked to draw a pic­ture. H would draw a com­pletely origi­nal pic­ture. How can we train A to draw an origi­nal pic­ture?

The most plau­si­ble ex­ist­ing ap­proach is prob­a­bly gen­er­a­tive ad­ver­sar­ial net­works. In this ap­proach, a dis­crim­i­na­tor is trained to dis­t­in­guish A’s be­hav­ior from H’s be­hav­ior, and A is trained to fool the dis­crim­i­na­tor.

But sup­pose that A draws a pic­ture by copy­ing an ex­ist­ing image. It may be hard for the dis­crim­i­na­tor to learn to dis­t­in­guish “origi­nal image” from “deriva­tive of ex­ist­ing image,” for ex­actly the same rea­sons dis­cussed be­fore. And so A may re­ceive just as high a re­ward by copy­ing an ex­ist­ing image as by draw­ing a novel pic­ture.

Un­for­tu­nately, solv­ing this prob­lem seems even more difficult for re­in­force­ment learn­ing. We can’t give the dis­crim­i­na­tor any ac­cess to A’s in­ter­nal state, since the dis­crim­i­na­tor isn’t sup­posed to know whether it is look­ing at data that came from A or from H.

In­stead, it might be eas­ier to use an al­ter­na­tive to the gen­er­a­tive ad­ver­sar­ial net­works frame­work. There are some plau­si­ble con­tenders, but noth­ing is cur­rently known that could plau­si­bly scale to gen­eral be­hav­ior in com­plex en­vi­ron­ments. (Though the ob­sta­cles are not always ob­vi­ous.)

3. In­verse re­in­force­ment learning

In IRL, we try to in­fer a re­ward func­tion that H is ap­prox­i­mately max­i­miz­ing. We can then use that re­ward func­tion to train A.

This ap­proach is closely con­nected to imi­ta­tion learn­ing, and faces ex­actly analo­gous difficul­ties:

  • H’s be­hav­ior does not give much in­for­ma­tion about the re­ward func­tion in re­gions far from H’s tra­jec­tory.

  • If we first learn a re­ward func­tion and then use it to train A, then the re­ward func­tion is es­sen­tially a di­rect su­per­vi­sor and faces ex­actly the same difficul­ties.

The sec­ond prob­lem seems to be the most se­ri­ous: un­less we find a re­s­olu­tion to that prob­lem, then di­rect su­per­vi­sion seems more promis­ing than IRL. (Though IRL may still be a use­ful tech­nique for the re­sult­ing RL prob­lem — un­der­stand­ing the su­per­vi­sor’s val­ues is a crit­i­cal sub­task of solv­ing an RL prob­lem defined by di­rect su­per­vi­sion.)

H’s be­hav­ior is not suffi­ciently informative

Con­sider the block tower ex­am­ple from the last sec­tion. If H always quickly builds a perfect block tower, then H’s be­hav­ior does not give any ev­i­dence about trade­offs be­tween differ­ent im­perfec­tions: how much should A be will­ing to com­pro­mise on qual­ity to get the job done faster? If the tower can only be tall or sta­ble, which is preferred?

To get around this difficulty, we would like to elicit in­for­ma­tion from H other than tra­jec­to­ries. For ex­am­ple, we might ask H ques­tions, and use those ques­tions as ev­i­dence about H’s re­ward func­tion.

In­cor­po­rat­ing this in­for­ma­tion is much less straight­for­ward than in­cor­po­rat­ing in­for­ma­tion from H’s be­hav­ior. For ex­am­ple, up­dat­ing on H’s state­ments re­quire an ex­plicit model of how H be­lieves its state­ments re­late to its goals, even though we can’t di­rectly ob­serve that re­la­tion­ship. This is much more com­plex than ex­ist­ing ap­proaches like MaxEnt IRL, which fit a sim­ple model di­rectly to H’s be­hav­ior.

Th­ese is­sues are cen­tral in “co­op­er­a­tive IRL.” For now there are many open prob­lems.

The ma­jor difficul­ties of di­rect su­per­vi­sion still apply

The big­ger prob­lem for IRL is how to rep­re­sent the re­ward func­tion:

  • If the re­ward func­tion is rep­re­sented by a con­crete, learned func­tion from tra­jec­to­ries to re­wards, then we are back in the situ­a­tion of di­rect su­per­vi­sion.

  • In­stead the re­ward func­tion may act on an ab­stract space of “pos­si­ble wor­lds.” This ap­proach po­ten­tially avoids the difficul­ties of di­rect su­per­vi­sion, but it seems to re­quire a par­tic­u­lar form of model-based RL. It’s not clear if this con­straint will be com­pat­i­ble with the most effec­tive ap­proaches to re­in­force­ment learn­ing.

Ideally we would find a bet­ter rep­re­sen­ta­tion that in­cor­po­rates the best of both wor­lds — avoid­ing the difficul­ties of di­rect su­per­vi­sion, with­out se­ri­ously re­strict­ing the form of A.

Alter­na­tively, we could hope that pow­er­ful RL agents have an ap­pro­pri­ate model-based ar­chi­tec­ture. Or we could do re­search on ap­pro­pri­ate forms of model-based RL to in­crease the prob­a­bil­ity that they are com­pet­i­tive.

Re­search directions

Each of the prob­lems dis­cussed in this post is a pos­si­ble di­rec­tion for re­search. I think that three prob­lems are es­pe­cially promis­ing:

  • Train­ing ML sys­tems to pro­duce the kind of aux­iliary in­for­ma­tion that could make di­rect su­per­vi­sion re­li­able. There are open the­o­ret­i­cal ques­tions about how this train­ing should be done — and huge prac­ti­cal ob­sta­cles to ac­tu­ally mak­ing it work.

  • Devel­op­ing al­ter­na­tive ob­jec­tives for imi­ta­tion learn­ing or gen­er­a­tive mod­el­ing. There has been a lot of re­cent progress in this area, and it is prob­a­bly worth do­ing more con­cep­tual work to see if we can find new frame­works.

  • Ex­per­i­ment­ing with “meet­ing halfway,” or with other prac­ti­cal ap­proaches for pro­duc­ing imitable demon­stra­tions.


If we can­not solve the re­ward en­g­ineer­ing prob­lem in prac­tice, it seems un­likely that we will be able to train ro­bustly benefi­cial RL agents.

Con­versely, if we can solve the re­ward en­g­ineer­ing prob­lem, then I be­lieve that solu­tion could be lev­er­aged into an at­tack on the whole value al­ign­ment prob­lem (along these lines — I will dis­cuss this in more de­tail over my next few posts).

Re­ward en­g­ineer­ing is not only an im­por­tant ques­tion for AI con­trol, but also ap­pears to be tractable to­day; there are both the­o­ret­i­cal and ex­per­i­men­tal lines of at­tack. I’m op­ti­mistic that we will un­der­stand this prob­lem much bet­ter over the com­ing years, and I think that will be very good news for AI con­trol.

(This re­search was sup­ported as part of the Fu­ture of Life In­sti­tute FLI-RFP-AI1 pro­gram, grant #2015–143898.)

This was origi­nally posted here on 30th May 2016.

The next post in this se­quence will be Ca­pa­bil­ity Am­plifi­ca­tion by Paul Chris­ti­ano.

To­mor­row’s post will be in the se­quence on Value Learn­ing by Ro­hin Shah.