Utility ≠ Reward

This es­say is an adap­ta­tion of a talk I gave at the Hu­man-Aligned AI Sum­mer School 2019 about our work on mesa-op­ti­mi­sa­tion. My goal here is to write an in­for­mal, ac­cessible and in­tu­itive in­tro­duc­tion to the worry that we de­scribe in our full-length re­port.

I will skip most of the de­tailed anal­y­sis from our re­port, and en­courage the cu­ri­ous reader to fol­low up this es­say with our se­quence or re­port.

The es­say has six parts:

Two dis­tinc­tions draws the foun­da­tional dis­tinc­tions be­tween
“op­ti­mised” and “op­ti­mis­ing”, and be­tween util­ity and re­ward.

What ob­jec­tives? dis­cusses the be­hav­ioral and in­ter­nal ap­proaches to un­der­stand­ing ob­jec­tives of ML sys­tems.

Why worry? out­lines the risk posed by the util­ity ≠ re­ward gap.

Mesa-op­ti­misers in­tro­duces our lan­guage for analysing this worry.

An al­ign­ment agenda sketches differ­ent al­ign­ment prob­lems pre­sented by these ideas, and sug­gests trans­parency and in­ter­pretabil­ity as a way to solve them.

Where does this leave us? sum­marises the es­say and sug­gests where to look next.

The views ex­pressed here are my own, and do not nec­es­sar­ily re­flect those of my coau­thors or MIRI. While I wrote this es­say in first per­son, all of the core ideas are the fruit of an equal col­lab­o­ra­tion be­tween Joar Skalse, Chris van Mer­wijk, Evan Hub­inger and my­self. I wish to thank Chris and Joar for long dis­cus­sions and in­put as I was writ­ing my talk, and all three, as well as Jaime Sevilla Molina, for thought­ful com­ments on this es­say.

≈3300 words.

Two distinctions

I wish to draw a dis­tinc­tion which I think is cru­cial for clar­ity about AI al­ign­ment, yet is rarely drawn. That dis­tinc­tion is be­tween the re­ward sig­nal of a re­in­force­ment learn­ing (RL) agent and its “util­ity func­tion”[1]. That is to say, it is not in gen­eral true that the policy of an RL agent is op­ti­mis­ing for its re­ward. To ex­plain what I mean by this, I will first draw an­other dis­tinc­tion, be­tween “op­ti­mised” and “op­ti­mis­ing”. Th­ese dis­tinc­tions lie at the core of our mesa-op­ti­mi­sa­tion frame­work.

It’s helpful to be­gin with an anal­ogy. Viewed ab­stractly, biolog­i­cal evolu­tion is an op­ti­mi­sa­tion pro­cess that searches through con­figu­ra­tions of mat­ter to find ones that are good at repli­ca­tion. Hu­mans are a product of this op­ti­mi­sa­tion pro­cess, and so we are to some ex­tent good at repli­cat­ing. Yet we don’t care, by and large, about repli­ca­tion in it­self.

Many things we care about look like repli­ca­tion. One might be mo­ti­vated by start­ing a fam­ily, or by hav­ing a legacy, or by similar closely re­lated things. But those are not repli­ca­tion it­self. If we cared about repli­ca­tion di­rectly, ga­mete dona­tion would be a far more main­stream prac­tice than it is, for in­stance.

Thus I want to dis­t­in­guish the ob­jec­tive of the se­lec­tion pres­sure that pro­duced hu­mans from the ob­jec­tives that hu­mans pur­sue. Hu­mans were se­lected for repli­ca­tion, so we are good repli­ca­tors. This in­cludes hav­ing goals that cor­re­late with repli­ca­tion. But it is plain that we are not mo­ti­vated by repli­ca­tion it­self. As a slo­gan, though we are op­ti­mised for repli­ca­tion, we aren’t op­ti­mis­ing for repli­ca­tion.

Another clear case where “op­ti­mised” and “op­ti­mis­ing” come apart are “dumb” ar­ti­facts like bot­tle caps. They can be heav­ily op­ti­mised for some pur­pose with­out op­ti­mis­ing for any­thing at all.

Th­ese ex­am­ples sup­port the first dis­tinc­tion I want to make: op­ti­misedop­ti­mis­ing. They also illus­trate how this dis­tinc­tion is im­por­tant in two ways:

  1. A sys­tem op­ti­mised for an ob­jec­tive need not be pur­su­ing any ob­jec­tives it­self. (As illus­trated by bot­tle caps.)

  2. The ob­jec­tive a sys­tem pur­sues isn’t de­ter­mined by the ob­jec­tive it was op­ti­mised for. (As illus­trated by hu­mans.)

The rea­son I draw this dis­tinc­tion is to ask the fol­low­ing ques­tion:

Our ma­chine learn­ing mod­els are op­ti­mised for some loss or re­ward. But what are they op­ti­mis­ing for, if any­thing? Are they like bot­tle caps, or like hu­mans, or nei­ther?

table 1

In other words, do RL agents have goals? And if so, what are they?

Th­ese ques­tions are hard, and I don’t think we have good an­swers to any of them. In any case, it would be pre­ma­ture, in light of the op­ti­mised ≠ op­ti­mis­ing dis­tinc­tion, to con­clude that a trained RL agent is op­ti­mis­ing for its re­ward sig­nal.

Cer­tainly, the RL agent (un­der­stood as the agent’s policy rep­re­sen­ta­tion, since that is the part that does all of the in­ter­est­ing de­ci­sion-mak­ing) is op­ti­mised for perfor­mance on its re­ward func­tion. But in the same way that hu­mans are op­ti­mised for repli­ca­tion, but are op­ti­mis­ing for our own goals, a policy that was se­lected for its perfor­mance on re­ward may in fact have its own in­ter­nally-rep­re­sented goals, only in­di­rectly linked to the in­tended re­ward. A pithy way to put this point is to say that util­ity ≠ re­ward, if we want to call the ob­jec­tive a sys­tem is op­ti­mis­ing its “util­ity”. (This is by way of metaphor – I don’t sug­gest that we must model RL agents as ex­pected util­ity max­i­miz­ers.)

Let’s make this more con­crete with an ex­am­ple. Say that we train an RL agent to perform well on a set of mazes. Re­ward is given for find­ing and reach­ing the exit door in each maze (which hap­pens to always be red). Then we freeze its policy and trans­fer the agent to a new en­vi­ron­ment set for test­ing. In the new mazes, the exit doors are blue, and red dis­trac­tor ob­jects are scat­tered el­se­where in the maze. What might the agent do in the new en­vi­ron­ment?

Three things might hap­pen.

  1. It might gen­er­al­ise: the agent could solve the new mazes just as well, reach­ing the exit and ig­nor­ing the dis­trac­tors.

  2. It might break un­der the dis­tri­bu­tional shift: the agent, un­used to the blue doors and weirdly-shaped dis­trac­tor ob­jects, could start twitch­ing or walk­ing into walls, and thus fails to reach the exit.

  3. But it might also fail to gen­er­al­ise in a more in­ter­est­ing way: the agent could fail to reach the exit, but could in­stead ro­bustly and com­pe­tently find the red dis­trac­tor in each maze we put it in.

To the ex­tent that it’s mean­ingful to talk about the agent’s goals, the con­trast be­tween the first and third cases sug­gests that those goals de­pend only on its policy, and are dis­tinct from its re­ward sig­nal. It is tempt­ing to say that the ob­jec­tive of the first agent is reach­ing doors; that the ob­jec­tive of the third agent is to reach red things. It does not mat­ter that in both cases, the policy was op­ti­mised to reach doors.

This makes sense if we con­sider how in­for­ma­tion about the re­ward gets into the policy:

fig 1

For any given ac­tion, the policy’s de­ci­sion is made in­de­pen­dently of the re­ward sig­nal. The re­ward is only used (stan­dardly, at least) to op­ti­mise the policy be­tween ac­tions. So the re­ward func­tion can’t be the policy’s ob­jec­tive – one can­not be pur­su­ing some­thing one has no di­rect ac­cess to. At best, we can hope that what­ever ob­jec­tive the learned policy has ac­cess to is an ac­cu­rate rep­re­sen­ta­tion of the re­ward. But the two can come apart, so we must draw a dis­tinc­tion be­tween the re­ward it­self and the policy’s in­ter­nal ob­jec­tive rep­re­sen­ta­tion.

To re­cap: whether an AI sys­tem is goal-di­rected or not is not triv­ially an­swered by the fact that it was con­structed to op­ti­mise an ob­jec­tive. To say that is to fail to draw the op­ti­mised ≠ op­ti­mis­ing dis­tinc­tion. If we then take se­ri­ously goal-di­rect­ed­ness in AI sys­tems, then we must draw a dis­tinc­tion be­tween the AI’s in­ter­nal learned ob­jec­tive and the ob­jec­tive it was trained on; that is, draw the util­ity ≠ re­ward dis­tinc­tion.

What ob­jec­tives?

I’ve been talk­ing about the ob­jec­tive of the RL agent, or its “util­ity”, as if it is an in­tu­itively sen­si­ble ob­ject. But what ac­tu­ally is it, and how can we know it? In a given train­ing setup, we know the re­ward. How do we figure out the util­ity?

In­tu­itively, the idea of the in­ter­nal goal be­ing pur­sued by a learned sys­tem feels com­pel­ling to me. Yet right now, we don’t have any good ways to make the in­tu­ition pre­cise – figur­ing out how to do that is an im­por­tant open ques­tion. As we start think­ing about how to make progress, there are at least two ap­proaches we can take: what I’d call the be­havi­oural ap­proach and the in­ter­nal ap­proach.

Tak­ing the be­havi­oural ap­proach, we look at how de­ci­sions made by a sys­tem sys­tem­at­i­cally lead to cer­tain out­comes. We then in­fer ob­jec­tives from study­ing those de­ci­sions and out­comes, treat­ing the sys­tem as a black box. For ex­am­ple, we could ap­ply In­verse Re­in­force­ment Learn­ing to our trained agents. Eliezer’s for­mal­i­sa­tion of op­ti­mi­sa­tion power also seems to fol­low this ap­proach.

Or, we can peer in­side the sys­tem, try­ing to un­der­stand the al­gorithm im­ple­mented by it. This is the in­ter­nal ap­proach. The goal is to achieve a mechanis­tic model that is ab­stract enough to be use­ful, but still grounded in the agent’s in­ner work­ings. In­ter­pretabil­ity and trans­parency re­search take this ap­proach gen­er­ally, though as far as I can tell, the spe­cific ques­tion of ob­jec­tives has not yet seen much at­ten­tion.

It’s un­clear whether one ap­proach is bet­ter, as both po­ten­tially offer use­ful tools. At pre­sent, I am more en­thu­si­as­tic about the in­ter­nal ap­proach, both philo­soph­i­cally and as a re­search di­rec­tion. Philo­soph­i­cally, I am more ex­cited about it be­cause un­der­stand­ing a model’s de­ci­sion-mak­ing feels more ex­plana­tory[2] than mak­ing gen­er­al­i­sa­tions about its be­havi­our. As a re­search di­rec­tion, it has po­ten­tial for em­piri­cally-grounded in­sights which might scale to fu­ture pro­saic AI sys­tems. Ad­di­tion­ally, there is the pos­si­bil­ity of low-hang­ing fruit, as this space ap­pears un­der­ex­plored.

Why worry?

Utility and re­ward are dis­tinct. So what? If a sys­tem is truly op­ti­mised for an ob­jec­tive, de­ter­min­ing its in­ter­nal mo­ti­va­tion is an unim­por­tant aca­demic de­bate. Only its real-world perfor­mance mat­ters, not the cor­rect in­ter­pre­ta­tion of its in­ter­nals. And if the perfor­mance is op­ti­mal, then isn’t our work done?

In prac­tice, we don’t get to op­ti­mise perfor­mance com­pletely. We want to gen­er­al­ise from limited train­ing data, and we want our sys­tems to be ro­bust to situ­a­tions not fore­seen in train­ing. This means that we don’t get to have a model that’s perfectly op­ti­mised for the thing we ac­tu­ally want. We don’t get op­ti­mal­ity on the full de­ploy­ment dis­tri­bu­tion com­plete with un­ex­pected situ­a­tions. At best, we know that the sys­tem is op­ti­mal on the train­ing dis­tri­bu­tion. In this case, know­ing whether the in­ter­nal ob­jec­tive of the sys­tem matches the ob­jec­tive we se­lected it for be­comes cru­cial, as if the sys­tem’s ca­pa­bil­ities gen­er­al­ise while its in­ter­nal goal is mis­al­igned, bad things can hap­pen.

Say that we prove, some­how, that op­ti­mis­ing the world with re­spect to some ob­jec­tive is safe and use­ful, and that we can train an RL agent us­ing that ob­jec­tive as re­ward. The util­ity ≠ re­ward dis­tinc­tion means that even in that ideal sce­nario, we are still not done with al­ign­ment. We still need to figure out a way to ac­tu­ally in­stall that ob­jec­tive (and not a differ­ent ob­jec­tive that still re­sults in op­ti­mal perfor­mance in train­ing) into our agent. Other­wise, we risk cre­at­ing an AI that ap­pears to work cor­rectly in train­ing, but which is re­vealed to be pur­su­ing a differ­ent goal when an un­usual situ­a­tion hap­pens in de­ploy­ment. So long as we don’t un­der­stand how ob­jec­tives work in­side agents, and how we can in­fluence those ob­jec­tives, we can­not be cer­tain of the safety of any sys­tem we build, even if we liter­ally some­how have a proof that the re­ward it was trained on was “cor­rect”.

Will highly-ca­pa­ble AIs be goal-di­rected? I don’t know for sure, and it seems hard to gather ev­i­dence about this, but my guess is yes. De­tailed dis­cus­sion is be­yond our scope, but I in­vite the in­ter­ested reader to look at some ar­gu­ments about this that we pre­sent in sec­tion 2 of the re­port. I also en­dorse Ro­hin Shah’s Will Hu­mans Build Goal-Directed Agents?.

All this opens the pos­si­bil­ity for mis­al­ign­ment be­tween re­ward and util­ity. Are there rea­sons to be­lieve the two will ac­tu­ally come apart? By de­fault, I ex­pect them to. Am­bi­guity and un­der­de­ter­mi­na­tion of re­ward mean that there are many dis­tinct ob­jec­tives that all re­sult in the same be­havi­our in train­ing, but which can dis­agree in test­ing. Think of the maze agent, whose re­ward in train­ing could mean “go to red things” or “go to doors”, or a com­bi­na­tion of the two. For rea­sons of bounded ra­tio­nal­ity, I also ex­pect pres­sures for learn­ing prox­ies for the re­ward in­stead of the true re­ward, when such prox­ies are available. Think of hu­mans, whose goals are largely prox­ies for re­pro­duc­tive suc­cess, rather than repli­ca­tion it­self. (This was a very brief overview; sec­tion 3 of our re­port ex­am­ines this ques­tion in depth, and ex­pands on these points more.)

The sec­ond rea­son these ideas mat­ter is that we might not want goal-di­rect­ed­ness at all. Maybe we just want tool AI, or AI ser­vices, or some other kind of non-agen­tic AI. Then, we want to be cer­tain that our AI is not some­how goal-di­rected in a way that would cause trou­ble off-dis­tri­bu­tion. This could hap­pen with­out us build­ing it in – af­ter all, evolu­tion didn’t set out to make goal-di­rected sys­tems. Goal-di­rect­ed­ness just turned out to be a good fea­ture to in­clude in its repli­ca­tors. Like­wise, it may be that goal-di­rect­ed­ness is a perfor­mance-boost­ing fea­ture in clas­sifiers, so pow­er­ful op­ti­mi­sa­tion tech­niques would cre­ate goal-di­rected clas­sifiers. Yet per­haps we are will­ing to take the perfor­mance hit in ex­change for en­sur­ing our AI is non-agen­tic. Right now, we don’t even get to choose, be­cause we don’t know when sys­tems are goal-di­rected, nor how to in­fluence learn­ing pro­cesses to avoid learn­ing goal-di­rect­ed­ness.

Tak­ing a step back, there is some­thing fun­da­men­tally con­cern­ing about all this.

We don’t un­der­stand our AIs’ ob­jec­tives, and we don’t know how to set them.

I don’t think this phrase should ring true in a world where we hope to build friendly AI. Yet to­day, to my ears, it does. I think that is a good rea­son to look more into this ques­tion, whether to solve it or to as­sure our­selves that the situ­a­tion is less bad than it sounds.


This worry is the sub­ject of our re­port. The frame­work of mesa-op­ti­mi­sa­tion is a lan­guage for talk­ing about goal-di­rected sys­tems un­der the in­fluence of op­ti­mi­sa­tion pro­cesses, and about the ob­jec­tives in­volved.

A part of me is wor­ried that the ter­minol­ogy in­vites view­ing mesa-op­ti­misers as a de­scrip­tion of a very spe­cific failure mode, in­stead of as a lan­guage for the gen­eral worry de­scribed above. I don’t know to what de­gree this mis­con­cep­tion oc­curs in prac­tice, but I wish to pre­empt it here any­way. (I want data on this, so please leave a com­ment if you had con­fu­sions about this af­ter read­ing the origi­nal re­port.)

In brief, our terms de­scribe the re­la­tion­ship be­tween a sys­tem do­ing some op­ti­mi­sa­tion (the base op­ti­miser, e.g.: evolu­tion, SGD), and a goal-di­rected sys­tem (the mesa-op­ti­miser, e.g.: hu­man, ML model) that is be­ing op­ti­mised by that first sys­tem. The ob­jec­tive of the base op­ti­miser is the base ob­jec­tive; the in­ter­nal ob­jec­tive of the mesa-op­ti­miser is the mesa-ob­jec­tive.

figure 2

(“Mesa” is a Greek word that means the op­po­site of “meta”. The rea­son we use “mesa” is to high­light that the mesa-op­ti­miser is an op­ti­miser that is it­self be­ing op­ti­mised by an­other op­ti­miser. It is a kind of dual to a meta-op­ti­miser, which is an op­ti­miser that is it­self op­ti­mis­ing an­other op­ti­miser.

While we’re on the topic of terms, “in­ner op­ti­miser” is a con­fus­ing term that we used in the past to re­fer to the same thing as “mesa-op­ti­miser”. It did not ac­cu­rately re­flect the con­cept, and has been re­tired in favour of the cur­rent ter­minol­ogy. Please use ”mesa-op­ti­miser” in­stead.)

I see “op­ti­miser” in “mesa-op­ti­miser” as a way of cap­tur­ing goal-di­rect­ed­ness, rather than a com­mit­ment to some kind of (util­ity-)max­imis­ing struc­ture. What feels im­por­tant to me is the goal-di­rect­ed­ness of the mesa-op­ti­miser, not its op­ti­mi­sa­tional na­ture: a goal-di­rected sys­tem which isn’t tak­ing strictly op­ti­mal ac­tions (but which is still com­pe­tent at pur­su­ing its mesa-ob­jec­tive) is still wor­ry­ing. It seems plau­si­ble that op­ti­mi­sa­tion is a good way to model goal-di­rect­ed­ness—though I don’t think we have made much progress on that front—but equally, it seems plau­si­ble that some other ap­proach we have not yet ex­plored could work bet­ter. So I my­self read the “op­ti­miser” in “mesa-op­ti­miser” analo­gously to how I ac­cept treat­ing hu­mans as op­ti­misers; as a metaphor, more than any­thing else.

I am not sure that mesa-op­ti­mi­sa­tion is the best pos­si­ble fram­ing of these con­cerns. I would wel­come more work that at­tempts to un­tan­gle these ideas, and to im­prove our con­cepts.

An al­ign­ment agenda

There are at least three al­ign­ment-re­lated ideas prompted by this worry.

The first is un­in­tended op­ti­mi­sa­tion. How do we en­sure that sys­tems that are not sup­posed to be goal-di­rected ac­tu­ally end up be­ing not-goal-di­rected?

The sec­ond is to fac­tor al­ign­ment into in­ner al­ign­ment and outer al­ign­ment. If we ex­pect our AIs to be goal-di­rected, we can view al­ign­ment as a two-step pro­cess. First, en­sure outer al­ign­ment be­tween hu­mans and the base ob­jec­tive of the AI train­ing setup, and then en­sure in­ner al­ign­ment be­tween the base ob­jec­tive and the mesa-ob­jec­tive of the re­sult­ing sys­tem. The former in­volves find­ing low-im­pact, cor­rigible, al­igned with hu­man prefer­ences, or oth­er­wise de­sir­able re­ward func­tions, and has been the fo­cus of much of the progress made by the al­ign­ment com­mu­nity so far. The lat­ter in­volves figur­ing out learned goals, in­ter­pretabil­ity, and a whole host of other po­ten­tial ap­proaches that have not yet seen much pop­u­lar­ity in al­ign­ment re­search.

The third is some­thing I want to call end-to-end al­ign­ment. It’s not ob­vi­ous that al­ign­ment must fac­tor in the way de­scribed above. There is room for try­ing to set up train­ing in such a way to guaran­tee a friendly mesa-ob­jec­tive some­how with­out match­ing it to a friendly base-ob­jec­tive. That is: to al­ign the AI di­rectly to its hu­man op­er­a­tor, in­stead of al­ign­ing the AI to the re­ward, and the re­ward to the hu­man. It’s un­clear how this kind of ap­proach would work in prac­tice, but this is some­thing I would like to see ex­plored more. I am drawn to stay­ing fo­cused on what we ac­tu­ally care about (the mesa-ob­jec­tive) and treat­ing other fea­tures as merely lev­ers that in­fluence the out­come.

We must make progress on at least one of these prob­lems if we want to guaran­tee the safety of pro­saic AI. If we don’t want goal-di­rected AI, we need to re­li­ably pre­vent un­in­tended op­ti­mi­sa­tion. Other­wise, we want to solve ei­ther in­ner and outer al­ign­ment, or end-to-end al­ign­ment. Suc­cess at any of these re­quires a bet­ter un­der­stand­ing of goal-di­rect­ed­ness in ML sys­tems, and a bet­ter idea of how to con­trol the emer­gence and na­ture of learned ob­jec­tives.

More broadly, it seems that tak­ing these wor­ries se­ri­ously will re­quire us to de­velop bet­ter tools for look­ing in­side our AI sys­tems and un­der­stand­ing how they work. In light of these con­cerns I feel pes­simistic about rely­ing solely on black-box al­ign­ment tech­niques. I want to be able to rea­son about what sort of al­gorithm is ac­tu­ally im­ple­mented by a pow­er­ful learned sys­tem if I am to feel com­fortable de­ploy­ing it.

Right now, learned sys­tems are (with maybe the ex­cep­tion of fea­ture rep­re­sen­ta­tion in vi­sion) more-or-less hope­lessly opaque to us. Not just in terms of goals, which is the topic here—most as­pects of their cog­ni­tion and de­ci­sion-mak­ing are ob­scure. The al­ign­ment con­cern about ob­jec­tives that I am pre­sent­ing here is just one ar­gu­ment for why we should take this ob­scu­rity se­ri­ously; there may be other risks hid­ing in our poor un­der­stand­ing of AI in­ner work­ings.

Where does this leave us?

In sum­mary, whether a learned sys­tem is pur­su­ing any ob­jec­tive is far from a triv­ial ques­tion. It is also not triv­ially true that a sys­tem op­ti­mised for achiev­ing high re­ward is op­ti­mis­ing for re­ward.

This means that with our cur­rent tech­niques and un­der­stand­ing, we don’t get to know or con­trol what ob­jec­tive a learned sys­tem is pur­su­ing. This mat­ters be­cause in un­usual situ­a­tions, it is that ob­jec­tive that will de­ter­mine the sys­tem’s be­havi­our. If that ob­jec­tive mis­matches the base ob­jec­tive, bad things can hap­pen. More broadly, our ig­no­rance about the cog­ni­tion of cur­rent sys­tems does not bode well for our prospects at un­der­stand­ing cog­ni­tion in more ca­pa­ble sys­tems.

This forms a sub­stan­tial hole in our prospects at al­ign­ing pro­saic AI. What sort of work would help patch this hole? Here are some can­di­dates:

  • Em­piri­cal work. Distill­ing ex­am­ples of goal-di­rected sys­tems and cre­at­ing con­vinc­ing scaled-down ex­am­ples of in­ner al­ign­ment failures, like the maze agent ex­am­ple.

  • Philo­soph­i­cal, de­con­fu­sion and the­o­ret­i­cal work. Im­prov­ing our con­cep­tual frame­works about goal-di­rect­ed­ness. This is a promis­ing place for philoso­phers to make tech­ni­cal con­tri­bu­tions.

  • In­ter­pretabil­ity and trans­parency. Get­ting bet­ter tools for un­der­stand­ing de­ci­sion-mak­ing, cog­ni­tion and goal-rep­re­sen­ta­tion in ML sys­tems.

Th­ese feel to me like the most di­rect at­tacks on the prob­lem. I also think there could be rele­vant work to be done in ver­ifi­ca­tion, ad­ver­sar­ial train­ing, and even psy­chol­ogy and neu­ro­science (I have in mind some­thing like a re­view of how these pro­cesses are un­der­stood in hu­mans and an­i­mals, though that might come up with noth­ing use­ful), and likely in many more ar­eas: this list is not in­tended to be ex­haus­tive.

While the pre­sent state of our un­der­stand­ing feels in­ad­e­quate, I can see promis­ing re­search di­rec­tions. This leaves me hope­ful that we can make sub­stan­tial progress, how­ever con­fus­ing these ques­tions ap­pear to­day.

  1. By “util­ity”, I mean some­thing like “the goal pur­sued by a sys­tem”, in the way that it’s used in de­ci­sion the­ory. In this post, I am us­ing this word loosely, so I don’t give a pre­cise defi­ni­tion. In gen­eral, how­ever, clar­ity on what ex­actly “util­ity” means for an RL agent is an im­por­tant open ques­tion. ↩︎

  2. Per­haps the in­tu­ition I have is a dis­tant cousin to the dis­tinc­tion drawn by Ein­stein be­tween prin­ci­ple and con­struc­tive the­o­ries. The in­ter­nal ap­proach seems more like a “con­struc­tive the­ory” of ob­jec­tives. ↩︎