Thoughts on reward engineering

Note: This is the first post from part five: pos­si­ble ap­proaches of the se­quence on iter­ated am­plifi­ca­tion. The fifth sec­tion of the se­quence breaks down some of these prob­lems fur­ther and de­scribes some pos­si­ble ap­proaches.

Sup­pose that I would like to train an RL agent to help me get what I want.

If my prefer­ences could be rep­re­sented by an eas­ily-eval­u­ated util­ity func­tion, then I could just use my util­ity func­tion as the agent’s re­ward func­tion. But in the real world that’s not what hu­man prefer­ences look like.

So if we ac­tu­ally want to turn our prefer­ences into a re­ward func­tion suit­able for train­ing an RL agent, we have to do some work.

This post is about the straight­for­ward parts of re­ward en­g­ineer­ing. I’m go­ing to de­liber­ately ig­nore what seem to me to be the hard­est parts of the prob­lem. Get­ting the straight­for­ward parts out of the way seems use­ful for talk­ing more clearly about the hard parts (and you never know what ques­tions may turn out to be sur­pris­ingly sub­tle).

The setting

To sim­plify things even fur­ther, for now I’ll fo­cus on the spe­cial case where our agent is tak­ing a sin­gle ac­tion a. All of the difficul­ties that arise in the sin­gle-shot case also arise in the se­quen­tial case, but the se­quen­tial case also has its own set of ad­di­tional com­pli­ca­tions that de­serve their own post.

Through­out the post I will imag­ine my­self in the po­si­tion of an “over­seer” who is try­ing to spec­ify a re­ward func­tion R(a) for an agent. You can imag­ine the over­seer as the user them­selves, or (more re­al­is­ti­cally) as a team of en­g­ineer and/​or re­searchers who are im­ple­ment­ing a re­ward func­tion in­tended to ex­presses the user’s prefer­ences.

I’ll of­ten talk about the over­seer com­put­ing R(a) them­selves. This is at odds with the usual situ­a­tion in RL, where the over­seer im­ple­ments a very fast func­tion for com­put­ing R(a) in gen­eral (“1 for a win, 0 for a draw, −1 for a loss”). Com­put­ing R(a) for a par­tic­u­lar ac­tion a is strictly eas­ier than pro­duc­ing a fast gen­eral im­ple­men­ta­tion, so in some sense this is just an­other sim­plifi­ca­tion. I talk about why it might not be a crazy sim­plifi­ca­tion in sec­tion 6.


  1. Long time hori­zons. How do we train RL agents when we care about the long-term effects of their ac­tions?

  2. In­con­sis­tency and un­re­li­a­bil­ity. How do we han­dle the fact that we have only im­perfect ac­cess to our prefer­ences, and differ­ent query­ing strate­gies are not guaran­teed to yield con­sis­tent or un­bi­ased an­swers?

  3. Nor­ma­tive un­cer­tainty. How do we train an agent to be­have well in light of its un­cer­tainty about our prefer­ences?

  4. Widely vary­ing re­ward. How do we han­dle re­wards that may vary over many or­ders of mag­ni­tude?

  5. Sparse re­ward. What do we do when our prefer­ences are very hard to satisfy, such that they don’t provide any train­ing sig­nal?

  6. Com­plex re­ward. What do we do when eval­u­at­ing our prefer­ences is sub­stan­tially more ex­pen­sive than run­ning the agent?

  • Con­clu­sion.

  • Ap­pendix: harder prob­lems.

1. Long time horizons

A sin­gle de­ci­sion may have very long-term effects. For ex­am­ple, even if I only care about max­i­miz­ing hu­man hap­piness, I may in­stru­men­tally want my agent to help ad­vance ba­sic sci­ence that will one day im­prove can­cer treat­ment.

In prin­ci­ple this could fall out of an RL task with “hu­man hap­piness” as the re­ward, so we might think that ne­glect­ing long-term effects is just a short­com­ing of the sin­gle-shot prob­lem. But even in the­ory there is no way that an RL agent can learn to han­dle ar­bi­trar­ily long-term de­pen­den­cies (imag­ine train­ing an RL agent to han­dle 40 year time hori­zons), and so fo­cus­ing on the se­quen­tial RL prob­lem doesn’t ad­dress this is­sue.

I think that the only real ap­proach is to choose a re­ward func­tion that re­flects the over­seer’s ex­pec­ta­tions about long-term con­se­quences — i.e., the over­seer’s task in­volves both mak­ing pre­dic­tions about what will hap­pen, and value judg­ments about how good it will be. This makes the re­ward func­tion more com­plex and in some sense limits the com­pe­tence of the learner by the com­pe­tence of the re­ward func­tion, but it’s not clear what other op­tions we have.

Be­fore com­put­ing the re­ward func­tion R(a), we are free to ex­e­cute the ac­tion a and ob­serve its short-term con­se­quences. Any data that could be used in our train­ing pro­cess can just as well be pro­vided as an in­put to the over­seer, who can use the aux­iliary in­put to help pre­dict the long-term con­se­quences of an ac­tion.

2. In­con­sis­tency and unreliability

A hu­man judge has no hope of mak­ing globally con­sis­tent judg­ments about which of two out­comes are preferred — the best we can hope for is for their judg­ments to be right in suffi­ciently ob­vi­ous cases, and to be some kind of noisy proxy when things get com­pli­cated. Ac­tu­ally out­putting a nu­mer­i­cal re­ward — im­ple­ment­ing some util­ity func­tion for our prefer­ences — is even more hope­lessly difficult.

Another way of see­ing the difficult is to sup­pose that the over­seer’s judg­ment is a noisy and po­ten­tial bi­ased eval­u­a­tion of the qual­ity of the un­der­ly­ing ac­tion. If both R(a) and R(a′) are both big num­bers with a lot of noise, but the two ac­tions are ac­tu­ally quite similar, then the differ­ence will be dom­i­nated by noise. Imag­ine an over­seer try­ing to es­ti­mate the im­pact of drink­ing a cup of coffee on Alice’s life by es­ti­mat­ing her hap­piness in a year con­di­tioned on drink­ing the coffee, es­ti­mat­ing hap­piness con­di­tioned on not drink­ing the coffee, and then sub­tract­ing the es­ti­mates.

We can par­tially ad­dress this difficulty by al­low­ing the over­seer to make com­par­i­sons in­stead of as­sess­ing ab­solute value. That is, rather than di­rectly im­ple­ment­ing a re­ward func­tion, we can al­low the over­seer to im­ple­ment an an­ti­sym­met­ric com­par­i­son func­tion C(a, a′): which of two ac­tions a and a′ is a bet­ter in con­text? This func­tion can take real val­ues spec­i­fy­ing how mu­chone ac­tion is bet­ter than an­other, and should by an­ti­sym­met­ric.

In the noisy-judg­ments model, we are hop­ing that the noise or bias of a com­par­i­son C(a, a′) de­pends on the ac­tual mag­ni­tude of the differ­ence be­tween the ac­tions, rather than on the ab­solute qual­ity of each ac­tion. This hope­fully means that the to­tal er­ror/​bias does not drown out the ac­tual sig­nal.

We can then define the de­ci­sion prob­lem as a zero-sum game: two agents pro­pose differ­ent ac­tions a and a′, and re­ceive re­wards C(a, a′) and C(a′, a). At the equil­ibrium of this game, we can at least rest as­sured that the agent doesn’t do any­thing that is un­am­bigu­ously worse than an­other op­tion it could think of. In gen­eral, this seems to give us sen­si­ble guaran­tees when the over­seer’s prefer­ences are not com­pletely con­sis­tent.

One sub­tlety is that in or­der to eval­u­ate the com­par­i­son C(a, a′), we may want to ob­serve the short-term con­se­quences of tak­ing ac­tion a or ac­tion a′. But in many en­vi­ron­ments it will only be pos­si­ble to take one ac­tion. So af­ter look­ing at both ac­tions we will need to choose at most one to ac­tu­ally ex­e­cute (e.g. we need to es­ti­mate how good drink­ing coffee was, af­ter ob­serv­ing the short-term con­se­quences of drink­ing coffee but with­out ob­serv­ing the short-term con­se­quences of not drink­ing coffee). This will gen­er­ally in­crease the var­i­ance of C, since we will need to use our best guess about the ac­tion which we didn’t ac­tu­ally ex­e­cute. But of course this is a source of var­i­ance that RL al­gorithms already need to con­tend with.

3. Nor­ma­tive uncertainty

The agent is un­cer­tain not only about its en­vi­ron­ment, but also about the over­seer (and hence the re­ward func­tion). We need to some­how spec­ify how the agent should be­have in light of this un­cer­tainty. Struc­turally, this is iden­ti­cal to the philo­soph­i­cal prob­lem of man­ag­ing nor­ma­tive un­cer­tainty.

One ap­proach is to pick a fixed yard­stick to mea­sure with. For ex­am­ple, our yard­stick could be “adding a dol­lar to the user’s bank ac­count.” We can then mea­sure C(a, a′) as a mul­ti­ple of this yard­stick: “how many dol­lars would we have to add to the user’s bank ac­count to make them in­differ­ent be­tween tak­ing ac­tion a and ac­tion a′?” If the user has diminish­ing re­turns to money, it would be a bit more pre­cise to ask: “what chance of re­plac­ing a with a′ is worth adding a dol­lar to the user’s bank ac­count?” The com­par­i­son C(a, a′) is then the in­verse of this prob­a­bil­ity.

This is ex­actly analo­gous to the usual con­struc­tion of a util­ity func­tion. In the case of util­ity func­tions, our choice of yard­stick is to­tally unim­por­tant — differ­ent pos­si­ble util­ity func­tions differ by a scalar, and so give rise to the same prefer­ences. In the case of nor­ma­tive un­cer­tainty that is no longer the case, be­cause we are spec­i­fy­ing how to ag­gre­gate the prefer­ences of differ­ent pos­si­ble ver­sions of the over­seer.

I think it’s im­por­tant to be aware that differ­ent choices of yard­stick re­sult in differ­ent be­hav­ior. But hope­fully this isn’t an im­por­tant differ­ence, and we can get sen­si­ble be­hav­ior for a wide range of pos­si­ble choices of yard­stick — if we find a situ­a­tion where differ­ent yard­sticks give very differ­ent be­hav­iors, then we need to think care­fully about how we are ap­ply­ing RL.

For many yard­sticks it is pos­si­ble to run into patholog­i­cal situ­a­tions. For ex­am­ple, sup­pose that the over­seer might de­cide that dol­lars are worth­less. They would then rad­i­cally in­crease the value of all of the agent’s de­ci­sions, mea­sured in dol­lars. So an agent de­cid­ing what to do would effec­tively care much more about wor­lds where the over­seer de­cided that dol­lars are worth­less.

So it seems best to choose a yard­stick whose value is rel­a­tively sta­ble across pos­si­ble wor­lds. To this effect we could use a broader bas­ket of goods, like 1 minute of the user’s time + 0.1% of the day’s in­come + etc. It may be best for the over­seer to use com­mon sense about how im­por­tant a de­ci­sion is rel­a­tive to some kind of ideal­ized in­fluence in the world, rather than stick­ing to any pre­cisely defined bas­ket.

It is also de­sir­able to use a yard­stick which is sim­ple, and prefer­ably which min­i­mizes the over­seer’s un­cer­tainty. Ideally by stan­dard­iz­ing on a sin­gle yard­stick through­out an en­tire pro­ject, we could end up with defi­ni­tions that are very broad and ro­bust, while be­ing very well-un­der­stood by the over­seer.

Note that if the same agent is be­ing trained to work for many users, then this yard­stick is also spec­i­fy­ing how the agent will weigh the in­ter­ests of differ­ent users — for ex­am­ple, whose ac­cents will it pre­fer to spend mod­el­ing ca­pac­ity on un­der­stand­ing? This is some­thing to be mind­ful of in cases where it mat­ters, and it can provide in­tu­itions about how to han­dle the nor­ma­tive un­cer­tainty case as well. I feel that eco­nomic rea­son­ing is use­ful for ar­riv­ing at sen­si­ble con­clu­sions in these situ­a­tions, but there are other rea­son­able per­spec­tives.

4. Widely vary­ing reward

Some tasks may have widely vary­ing re­wards — some­times the user would only pay 1¢ to move the de­ci­sion one way or the other, and some­times they would pay $10,000.

If small-stakes and large-stakes de­ci­sions oc­cur com­pa­rably fre­quently, then we can es­sen­tially ig­nore the small-stakes de­ci­sions. That will hap­pen au­to­mat­i­cally with a tra­di­tional op­ti­miza­tion al­gorithm — af­ter we nor­mal­ize the re­wards so that the “big” re­wards don’t to­tally de­stroy our model, the “small” re­wards will be so small that they have no effect.

Things get more tricky when small-stakes de­ci­sions are much more com­mon than the large-stakes de­ci­sions. For ex­am­ple, if the im­por­tance of de­ci­sions is power-law dis­tributed with an ex­po­nent of 1, then de­ci­sions of all scales are in some sense equally im­por­tant, and a good al­gorithm needs to do well on all of them. This may sound like a very spe­cial case, but I think it is ac­tu­ally quite nat­u­ral for there to be sev­eral scales that are all com­pa­rably im­por­tant in to­tal.

In these cases, I think we should do im­por­tance sam­pling — we over­sam­ple the high-stakes de­ci­sions dur­ing train­ing, and scale the re­wards down by the same amount, so that the con­tri­bu­tion to the to­tal re­ward is cor­rect. This en­sures that the scale of re­wards is ba­si­cally the same across all epi­sodes, and lets us ap­ply a tra­di­tional op­ti­miza­tion al­gorithm.

Fur­ther prob­lems arise when there are some very high-stakes situ­a­tions that oc­cur very rarely. In some sense this just means the learn­ing prob­lem is ac­tu­ally very hard — we are go­ing to have to learn from few sam­ples. Treat­ing differ­ent scales as the same prob­lem (us­ing im­por­tance sam­pling) may help if there is sub­stan­tial trans­fer be­tween differ­ent scales, but it can’t ad­dress the whole prob­lem.

For very rare+high-stakes de­ci­sions it is es­pe­cially likely that we will want to use simu­la­tions to avoid mak­ing any ob­vi­ous mis­takes or miss­ing any ob­vi­ous op­por­tu­ni­ties. Learn­ing with catas­tro­phes is an in­stan­ti­a­tion of this set­ting, where the high-stakes set­tings have only down­side and no up­side. I don’t think we re­ally know how to cope with rare high-stakes de­ci­sions; there are likely to be some fun­da­men­tal limits on how well we can do, but I ex­pect we’ll be able to im­prove a lot over the cur­rent state of the art.

5. Sparse reward

In many prob­lems, “al­most all” pos­si­ble ac­tions are equally ter­rible. For ex­am­ple, if I want my agent to write an email, al­most all pos­si­ble strings are just go­ing to be non­sense.

One ap­proach to this prob­lem is to ad­just the re­ward func­tion to make it eas­ier to satisfy — to provide a “trail of bread­crumbs” lead­ing to high re­ward be­hav­iors. I think this ba­sic idea is im­por­tant, but that chang­ing the re­ward func­tion isn’t the right way to im­ple­ment it (at least con­cep­tu­ally).

In­stead we could treat the prob­lem state­ment as given, but view aux­iliary re­ward func­tions as a kind of “hint” that we might provide to help the al­gorithm figure out what to do. Early in the op­ti­miza­tion we might mostly op­ti­mize this hint, but as op­ti­miza­tion pro­ceeds we should an­neal to­wards the ac­tual re­ward func­tion.

Typ­i­cal ex­am­ples of proxy re­ward func­tions in­clude “par­tial credit” for be­hav­iors that look promis­ing; ar­tifi­cially high dis­count rates and care­ful re­ward shap­ing; and ad­just­ing re­wards so that small vic­to­ries have an effect on learn­ing even though they don’t ac­tu­ally mat­ter. All of these play a cen­tral role in prac­ti­cal RL.

A proxy re­ward func­tion is just one of many pos­si­ble hints. Pro­vid­ing demon­stra­tions of suc­cess­ful be­hav­ior is an­other im­por­tant kind of hint. Again, I don’t think that this should be taken as a change to the re­ward func­tion, but rather as side in­for­ma­tion to help achieve high re­ward. In the long run, we will hope­fully de­sign learn­ing al­gorithms that au­to­mat­i­cally learn how to use gen­eral aux­iliary in­for­ma­tion.

6. Com­plex reward

A re­ward func­tion that in­tends to cap­ture all of our prefer­ences may need to be very com­pli­cated. If a re­ward func­tion is im­plic­itly es­ti­mat­ing the ex­pected con­se­quences of an ac­tion, then it needs to be even more com­pli­cated. And for pow­er­ful learn­ers, I ex­pect that re­ward func­tions will need to be learned rather than im­ple­mented di­rectly.

It is tempt­ing to sub­sti­tute a sim­ple proxy for a com­pli­cated real re­ward func­tion. This may be im­por­tant for get­ting the op­ti­miza­tion to work, but it is prob­le­matic to change the defi­ni­tion of the prob­lem.

In­stead, I hope that it will be pos­si­ble to provide these sim­ple prox­ies as hints to the learner, and then to use semi-su­per­vised RL to op­ti­mize the real hard-to-com­pute re­ward func­tion. This may al­low us to perform op­ti­miza­tion even when the re­ward func­tion is many times more ex­pen­sive to eval­u­ate than the agent it­self; for ex­am­ple, it might al­low a hu­man over­seer to com­pute the re­wards for a fast RL agent on a case by case ba­sis, rather than be­ing forced to de­sign a fast-to-com­pute proxy.

Even if we are will­ing to spend much longer com­put­ing the re­ward func­tion than the agent it­self, we still won’t be able to find a re­ward func­tion that perfectly cap­tures our prefer­ences. But it may be just as good to choose a re­ward func­tion that cap­tures our prefer­ences “for all that the agent can tell,” i.e. such that the con­di­tioned on two out­comes re­ceiv­ing the same ex­pected re­ward the agent can­not pre­dict which of them we would pre­fer. This seems much more re­al­is­tic, once we are will­ing to have a re­ward func­tion with much higher com­pu­ta­tional com­plex­ity than the agent.


In re­in­force­ment learn­ing we of­ten take the re­ward func­tion as given. In real life, we are only given our prefer­ences — in an im­plicit, hard-to-ac­cess form — and need to en­g­ineer a re­ward func­tion that will lead to good be­hav­ior. This pre­sents a bunch of prob­lems. In this post I dis­cussed six prob­lems which I think are rel­a­tively straight­for­ward. (Straight­for­ward from the re­ward-en­g­ineer­ing per­spec­tive — the as­so­ci­ated RL tasks may be very hard!)

Un­der­stand­ing these straight­for­ward prob­lems is im­por­tant if we want to think clearly about very pow­er­ful RL agents. But I ex­pect that most of our time will go into think­ing about harder prob­lems, for which we don’t yet have any work­able ap­proach. Th­ese harder prob­lems may ex­pose more fun­da­men­tal limits of RL, that will re­quire sub­stan­tially new tech­niques to ad­dress.

Ap­pendix: harder problems

In­formed oversight

The pro­cess that pro­duces a de­ci­sion may en­code im­por­tant in­for­ma­tion about the qual­ity of that de­ci­sion, and if we throw out that in­for­ma­tion then a very bad de­ci­sion may nev­er­the­less re­ceive a high re­ward. For ex­am­ple, if we want an agent to pro­duce origi­nal art, we may not be able to eval­u­ate a paint­ing with­out un­der­stand­ing the pro­cess that pro­duced it (and, in par­tic­u­lar, what other art­work it was in­spired by).

In some cases, re­cov­er­ing in­for­ma­tion about the pro­cess seems fea­si­ble. For ex­am­ple, if the agent is us­ing at­ten­tion or an ex­ter­nal mem­ory, the over­seer can check what art the agent looked at. But in other cases it seems much more challeng­ing: if the agent’s weights have mem­o­rized prop­er­ties of par­tic­u­lar art­work, we would need new tech­niques in or­der to make that in­for­ma­tion available to the over­seer.

It is not clear whether there is any gen­eral way to give the over­seer enough in­sight into the agent’s thought pro­cess to avoid prob­lems like this; there may be some in­her­ent limi­ta­tions on del­e­ga­tion. Either way, it would be nice to un­der­stand the situ­a­tion much bet­ter.

Ca­pa­bil­ity amplification

By fold­ing our prefer­ences and long-term pre­dic­tions into the re­ward func­tion, we are in some sense limit­ing the agent to be no stronger than the over­seer. If we start with a hu­man, then RL will only ever pro­duce hu­man-level rea­son­ing about long-term con­se­quences or about “what is good.”

So we need some other mechanism for pro­duc­ing bet­ter long-term rea­son­ing. I think that the most promis­ing ap­proach is to treat “rea­son­ing well” as a short-term prob­lem, for which hu­mans can provide ad­e­quate over­sight, and then to run that rea­son­ing pro­cess for longer in or­der to pro­duce more pow­er­ful rea­son­ing.

Re­ward hack­ing /​ security

If our re­ward func­tion has any prob­lems, a pow­er­ful RL agent may tend to find ac­tions a that ex­ploit those prob­lems in or­der to re­ceive a high re­ward. In or­der to get good con­se­quences, our re­ward func­tion needs to be suffi­ciently se­cure that the agent can’t at­tack it in this way.

The first line of defense against an at­tack is to avoid im­me­di­ately ex­e­cut­ing the agent’s ac­tion. In­stead, the over­seer can in­spect that ac­tion and try to en­sure that it isn’t in­tended as an at­tack. This leaves at­tacks which the over­seer can­not rec­og­nize as at­tacks, or which do dam­age even when the over­seer looks at them.

If the tech­niques from the pre­vi­ous sec­tions ac­tu­ally al­low the over­seer to eval­u­ate the agent’s ac­tions, then they can also prob­a­bly al­low the over­seer can de­tect at­tacks. Se­cu­rity dur­ing eval­u­a­tion it­self is an ad­di­tional ques­tion though.

The main cause for hope is if the over­seer can (1) be smarter than the agent which is try­ing to at­tack it, and (2) have ac­cess to some in­for­ma­tion about the agent’s thought pro­cess. Hope­fully (2) al­lows the over­seer to over­come the dis­ad­van­tage of the “po­si­tion of the in­te­rior” — if the agent picks a par­tic­u­lar at­tack vec­tor, the over­seer can “watch them think­ing” and then de­vote its en­er­gies to try­ing to de­tect or defend against that par­tic­u­lar at­tack.