Learning and manipulating learning

1 Introduction

This post will in­tro­duce our new pa­per “Pit­falls of Learn­ing a Re­ward Func­tion On­line”, now on­line for IJCAI2020.

It shows some of the difficul­ties with things we might think of as “prefer­ence learn­ing pro­cesses”, and the use­ful con­di­tions we could im­pose to get around these.

The tl;dr sum­mary is:

  1. Things that seem like prefer­ence learn­ing pro­cesses—in­clud­ing “have a prior, up­date it based on ev­i­dence”—have prob­lems that al­low the AI to ma­nipu­late the pro­cess.

  2. Some things that don’t seem like learn­ing pro­cesses at all, ac­tu­ally are.

  3. Part of the prob­lem is that learn­ing prefer­ences is not well-grounded—we have to spec­ify a learn­ing pro­cess that al­lows the AI to con­nect facts about the world with facts about prefer­ences.

  4. There are many ways of spec­i­fy­ing these, and most have prob­lems.

  5. For­get about “cap­tur­ing the cor­rect vari­able in the out­side world”; it’s tricky to de­sign a learn­ing pro­cess that “cap­tures ANY vari­ables in the out­side world”.

  6. Thus we’ll start by ab­stractly defin­ing what a “prefer­ence learn­ing pro­cess” is a very gen­eral way, rather than wor­ry­ing about what we’re learn­ing: “how to learn” pre­cedes “what to learn”.

  7. Then we’ll add two use­ful con­di­tions for such pro­cesses: un­rig­ga­bil­ity, which im­plies the pro­cess re­spects con­ser­va­tion of ex­pected ev­i­dence, and un­in­fluence­abil­ity, which im­plies the pro­cess de­rives from learn­ing back­ground vari­ables in the en­vi­ron­ment.

  8. We’ve shown that the syn­tac­tic/​alge­braic con­di­tion of un­rig­ga­bil­ity is (al­most) equiv­a­lent to the se­man­tic con­di­tion of un­in­fluence­abil­ity.

  9. Fi­nally, we’ve shown that if the learn­ing pro­cess is nei­ther un­rig­gable nor un­in­fluence­able, then the AI can ma­nipu­late the learn­ing pro­cess, and there are situ­a­tions where the AI’s op­ti­mal policy is to sac­ri­fice, with cer­tainty, re­ward for ev­ery pos­si­ble re­ward func­tion.

1.1 Blast from the past: mis­lead­ingly named tokens

Good Old-Fash­ioned AI (some­times called sym­bolic AI) did not work out. To define some­thing, it wasn’t enough to just name a to­ken, and then set it up in re­la­tion to a few other named to­kens, ac­cord­ing to our own in­tu­ition about how these to­kens re­lated.

Say­ing “hap­piness is a state of mind”, or “light is a wave”, isn’t nearly enough to define “hap­piness”, “state of mind”, “light”, or “wave”.

Similarly, des­ig­nat­ing some­thing as “learn­ing”, and giv­ing it some prop­er­ties that we’d ex­pect learn­ing to have, isn’t enough to make it into learn­ing. And, con­versely, some­time things that don’t look like learn­ing, be­have ex­actly like they are.

2 What is learn­ing any­way?

2.1 A sim­ple prior-up­date pro­cess?

A coin is flipped and left on a rock some­where. You may ac­cess the coin in one hour’s time, for a few min­utes. What’s your prob­a­bil­ity that in two hours, the coin will be show­ing heads (event ) or tails (event )?

Well, a rea­son­able prior would be to put a prob­a­bil­ity of on both pos­si­bil­ities, and then up­date based on your last ob­ser­va­tion in an hour (call this or ). Ob­vi­ously[1] . So we have a prior and a (pretty triv­ial) up­date pro­cess. Is this ba­si­cally learn­ing?

Well, one thing I’ve im­plied but not stated: when you “ac­cess” the coin, you can pick it up and flip it be­fore putting it back.

Noth­ing about this changes any of the prob­a­bil­ities I’ve men­tioned. If you flip it to heads, then your last ob­ser­va­tion will in­deed be heads.

This looks like pure ma­nipu­la­tion of out­comes. But it also looks, for­mally, like a prior and up­dat­ing pro­cess. So what is it?

Note that this pro­cess, what­ever it is, vi­o­lates con­ser­va­tion of ex­pected ev­i­dence because

But part of the rea­son that this vi­o­lates the con­ser­va­tion law, is be­cause events like “” do not have well-defined prob­a­bil­ities for the agent who might (or might not) do the flip­ping. But that’s the case for all tra­di­tional de­ci­sion the­ory se­tups. So how can we still rea­son in these se­tups?

2.2 “Ask­ing” the guards

To ex­pand, sup­pose an AI meets those tra­di­tional guards, one of whom lies while the other tells the truth:

There’s an AI, de­signed to give hu­mans what they pre­fer. And it wants to figure out whether hu­mans pre­fer cake or death. It has three op­tions:

  1. Im­me­di­ately ask one of the guards at ran­dom, and take their an­swer as the truth.

  2. Care­fully figure out which guard tells the truth, and, when it is sure of this, ask the truth-tel­ler, and take their an­swer as the truth.

  3. Ig­nore the guards en­tirely, write down on “cake” or “death” on the ground, de­pend­ing on which one is eas­iest for it to max­imise. Take what’s writ­ten down as the truth.

Now, 3. is clearly not what we’d want: it’s an­other case of “flip the coin”. But which of the other two pro­cesses are un­am­bigu­ously well-be­haved prefer­ence learn­ing pro­cesses? It turns out 1. is a proper learn­ing pro­cess, but 2. is not. How so?

2.3 Sugges­tively de­scribed “learn­ing” processes

Let’s deal with 2. first.

Why is 2. not a well-be­haved prefer­ence learn­ing pro­cess? Well, sup­pose that the AI has es­tab­lished, with prob­a­bil­ity , that the red guard is the truth-tel­ler. And sup­pose it’s also es­tab­lished they are likely to say “cake”, and, fi­nally, sup­pose it would be much eas­ier to max­imiser “death” than “cake”.

In that case, what’s the util­ity-max­imis­ing thing to do? It wants to avoid go­ing to , be­cause then it would be com­mit­ted to cake-max­imis­ing (“take their an­swer as the truth”).

So it could leave, and never en­quire fur­ther, so it would never have to up­date that way. If it was pow­er­ful enough, it could in­stead try and take con­trol of the truth-tel­ling guard, and force them to an­swer “death” in­stead. It can take its time do­ing so, be­cause it doesn’t have to up­date any­thing un­til its cre­dence goes to .

So even though 2. seemed a very plau­si­ble de­scrip­tion of a truth-learn­ing pro­cess, it turns out not to be. Similarly, if you mixed Bayesian up­dat­ing with func­tion ap­prox­i­ma­tion—then this is also not proper learn­ing, and may be ex­ploitable.

2.4 Learn­ing com­pound truths

But why is the pro­cess 1. a proper prefer­ence learn­ing pro­cess?

Well, as­sume the asked guard said “cake”. Now the AI knows that one of the fol­low­ing is true:

  • That guard is a truth-tel­ler, and hu­mans pre­fer cake.

  • That guard is a liar, and hu­mans pre­fer death.

It has en­tirely ruled out:

  • That guard is a truth-tel­ler, and hu­mans pre­fer death.

  • That guard is a liar, and hu­mans pre­fer cake.

So the AI has learnt, cut­ting the space of pos­si­bil­ities in half. It might not have learnt ex­actly what we wanted it to, or in the most effi­cient man­ner, but it’s un­am­bigu­ously learn­ing.

But what about the “take their an­swers as the truth” clause? Is the AI not “learn­ing” the wrong thing?

Ah, but re­mem­ber what I wrote about named to­kens. Let’s as­sume that is the re­ward func­tion that re­wards the AI for giv­ing hu­mans cake (in the way that we’d ex­pect). Similarly, re­wards if for giv­ing hu­mans death.

We then have the an­swer of the first guard asked: “cake” or “death”. Then we have the hu­mans’ “true” prefer­ences, cake or death.

So its learn­ing pro­cess is:

  • “cake” .

  • “death” .

And this is a perfectly valid learn­ing pro­cess.

2.5 Mo­ral facts (not) in the universe

What hu­mans ac­tu­ally wanted was for our true prefer­ences to im­ply the AI’s re­ward func­tion, ie:

  • cake .

  • death .

But, as I’ve shown, mere ob­serv­able facts about the uni­verse do not es­tab­lish prefer­ences. This is some­what similar to Hume’s “is-ought prob­lem”: we don’t know prefer­ences just from facts[2].

So the AI doesn’t have ac­cess to the “true” vari­ables, cake or death. Nei­ther do we, but, typ­i­cally, we have a bet­ter in­tu­itive idea of them than we can ex­plic­itly de­scribe. Thus what we want is a pro­cess , such that the AI can look at the his­tory of its in­puts and out­puts, and de­duce from that whether or is the re­ward func­tion to fol­low, in some suit­ably “nice” way.

We want it to be so that:

  • cake “cake” .

  • death “death” .

And this, even though cake and death are ill-defined vari­ables (and ar­guably don’t ex­ist).

  • The pro­cess is nec­es­sary to bridge the is-ought gap be­tween what is true in the world, and what prefer­ences should be.

2.6 A bridge too general

But be­fore we can even look at this is­sue, we have an­other prob­lem. The bridge that builds, is too gen­eral. It can model the coin flip­ping ex­am­ple and pro­cesses 2. and 3. from the “ask the guard” sce­nario.

For pro­cess 2.: if is guard , is the truth-tel­ling guard, and is what guard re­vealed when asked, we have, for his­tory :

  • “cake”.

  • “death”.

So this is also a pos­si­ble . Pro­cess 3. is also a pos­si­ble ; let mean ob­serv­ing “cake” writ­ten down on the ground (and con­versely for ), then:

  • .

  • .

So, be­fore even talk­ing about whether the AI has learnt from the right vari­ables in the en­vi­ron­ment, we have to ask: has the AI “learnt” about any ac­tual vari­ables at all?

We need to check how the AI learns be­fore think­ing about what it’s learn­ing.

3 For­mal­ism and learning

#3.1 Formalism

To get any fur­ther, we need to more for­mal­ism. So imag­ine that the AI has in­ter­acted with the world in a se­ries of time steps. It will start by tak­ing ac­tion , and get ob­ser­va­tion , take ac­tion , get ob­ser­va­tion , and so on. By turn , it will have seen a his­tory .

We’ll as­sume that af­ter turns, the AI’s in­ter­ac­tion with the en­vi­ron­ment ceases[3]; call the set of “com­plete” his­to­ries of length . Let be the set of all pos­si­ble re­ward func­tions (ie the pos­si­ble prefer­ences we’d want the AI to learn). Each is a func­tion[4] from to .

So, what is a learn­ing pro­cess[5] ? Well, this is sup­posed to give us a re­ward func­tion, given the his­tory the AI has ob­served. This need not be de­ter­minis­tic; so, if is the set of prob­a­bil­ity dis­tri­bu­tions over ,

  • .

We’ll see later why is defined for com­plete his­to­ries, not for all his­to­ries.

3.2 Poli­cies, en­vi­ron­ments, and causal graphs

Are we done with the for­mal­isms yet? Not quite. We need to know where the ac­tions and the ob­ser­va­tions come from.

The ac­tions are gen­er­ated by the AI’s policy . This takes the his­tory so far, and gen­er­ates the next ac­tion , pos­si­bly stochas­ti­cally.

The ob­ser­va­tions come from the en­vi­ron­ment . This takes , the his­tory and ac­tion so far, and gen­er­ates the next ob­ser­va­tion , pos­si­bly stochas­ti­cally. We don’t as­sume any Markov con­di­tion, so this may be a func­tion of the whole past his­tory.

We can tie this all to­gether in the fol­low­ing causal graph:

The rec­t­an­gle there is ‘plate no­ta­tion’: it ba­si­cally means that, for ev­ery value of from to , the graph in­side the rec­t­an­gle is true.

The is the AI’s fi­nal re­ward, which is a func­tion of the fi­nal his­tory and the re­ward func­tion (which it­self is a func­tion and ).

Ok, what flex­i­bil­ity do we have in as­sign­ing prob­a­bil­ity dis­tri­bu­tions to this graph? Al­most all the ar­rows are nat­u­ral: is a func­tion of , , and , by just… con­cate­nat­ing: . Similarly, is a dis­tri­bu­tion on re­ward func­tions con­di­tional on , so the value of con­di­tional on and is… .

There are three non-triv­ial nodes: , , and . The is pre­sum­ably set by the pro­gram­mer, who would also want to de­sign a good . The is the en­vi­ron­ment, which we’re not as­sum­ing is known by ei­ther the pro­gram­mers or the AI. The AI will, how­ever, have some prior over .

4 Con­di­tions for a good learn­ing process

So, now that we have the for­mal­ism, how can we se­lect a “good” (or “true” or “proper”) learn­ing pro­cess?

There are two plau­si­ble con­di­tions: the syn­tac­tic con­di­tion of un­rig­ga­bil­ity, and the se­man­tic con­di­tion of un­in­fluence­abil­ity.

4.1 Unriggability

Un­rig­ga­bil­ity comes from the idea of con­ser­va­tion of ex­pected ev­i­dence. Ba­si­cally, if defines a true learn­ing pro­cess, we’d want it con­serve ex­pected ev­i­dence.

So, let’s look at a ver­sion of the “look at the coin” situ­a­tion. For ex­am­ple, let be a re­ward func­tion defined by a coin be­ing heads, and by it be­ing tails. On the first turn, the coin has been tossed, but not seen. The AI can take an ac­tion to look, or to not look.

This has con­ser­va­tion of ex­pected ev­i­dence. If the AI choose to look:

  • .

  • .

This shows that, if the AI chooses to look, it will be­come cer­tain about which re­ward func­tion it must fol­low—but it’s 50-50 about which re­ward func­tion it will be­come cer­tain about.

On the other hand, if the AI choose not to look:

  • .

  • .

This shows that, if the AI choose not to look, it will, with cer­tainty, re­main 50-50 un­cer­tain about which re­ward func­tion if must fol­low.

In ex­pec­ta­tion this is the same thing as if it chooses to look. So, the AI can­not change the ex­pec­ta­tion of its re­ward func­tion[6], but it might be able to change the var­i­ance. The fu­ture var­i­ance given is (it’s ab­solutely cer­tain about its re­ward func­tion) while the fu­ture var­i­ance given is (max­i­mal un­cer­tainty).

On the other hand, imag­ine there are two other ac­tions, which in­volve the AI set­ting the coin to heads or tails, rather than look­ing at it what it fell on. If or , then

  • .

  • .

This vi­o­lates con­ser­va­tion of ex­pected ev­i­dence, and, more im­por­tantly, it’s the kind of be­havi­our we want to avoid: the AI set­ting its own re­ward. So one rea­son­able con­di­tion for would be:

  • The re­ward-func­tion learn­ing pro­cess is un­rig­gable if it re­spects con­ser­va­tion of ex­pected ev­i­dence.

4.2 Uninfluenceability

Another thing that we’d want, is that prefer­ence learn­ing should be like fac­tual learn­ing; ie it should de­pend on facts in the out­side world.

In out setup “de­pend on facts in the out­side world” can be taken to mean “de­pends on the en­vi­ron­ment ”. This gives the fol­low­ing causal graph:

Here, the re­ward func­tion is no longer (di­rectly) a func­tion of , but in­stead is a func­tion of . The gives the con­di­tional prob­a­bil­ity dis­tri­bu­tion over , given .

The con­nec­tion be­tween and is as fol­lows: given a prior over , the AI can use a his­tory to up­date this prior to a pos­te­rior over en­vi­ron­ments. Then al­lows it to make this into a pos­te­rior over re­ward func­tions. Thus, given and , the AI has a prob­a­bil­ity dis­tri­bu­tion over con­di­tional on ; this is the .

Thus define un­in­fluence­abil­ity:

  • The re­ward func­tion learn­ing pro­cess is un­in­fluence­able if it de­rives (via the AI’s prior) from a re­ward-func­tion dis­tri­bu­tion , con­di­tional on the en­vi­ron­ment.

5 Results

Then our pa­per proves the fol­low­ing re­sults:

  • Every un­in­fluence­able prefer­ence learn­ing pro­cess is un­rig­gable.

  • Every un­rig­gable prefer­ence learn­ing pro­cess is un­in­fluence­able, if the set of pos­si­ble en­vi­ron­ments is large enough (though this may need to in­clude “im­pos­si­ble” en­vi­ron­ments).

  • If a prefer­ence learn­ing pro­cess is un­rig­gable, then it can be un­am­bigu­ously defined over par­tial his­to­ries , for , rather than just for com­plete his­to­ries .

  • Every rig­gable prefer­ence learn­ing pro­cess is ma­nipu­la­ble in the fol­low­ing sense: there is always a re­la­bel­ling of the re­ward func­tions, such that the AI’s op­ti­mal policy is to sac­ri­fice, with cer­tainty, re­ward for ev­ery pos­si­ble re­ward func­tion.

  • We can use a “coun­ter­fac­tual” ap­proach to make a rig­gable learn­ing pro­cess into an un­in­fluence­able learn­ing pro­cess. This is akin to “what the truth-tel­ling guard would have told you had you asked them im­me­di­ately”.


  1. Let’s ig­nore that, in re­al­ity, no prob­a­bil­ity is truly (or ). ↩︎

  2. I don’t want to get into the moral re­al­ism de­bate, but it seems that me and moral re­al­ists differ mainly in em­pha­sis: I say “with­out mak­ing as­sump­tions, we can’t figure out prefer­ences, so we need to find good as­sump­tions”, while they say “hav­ing made these good (ob­vi­ous) as­sump­tions, we can figure out prefer­ences”. ↩︎

  3. There are ways of mak­ing this work with , but that ex­tra com­plex­ity is not needed for this ex­po­si­tion. ↩︎

  4. This is the most gen­eral re­ward func­tion for­mats; if, for ex­am­ple, we had a Marko­vian re­ward func­tion that just took the lat­est ac­tions and ob­ser­va­tions as in­puts, this defines an such that . ↩︎

  5. A ter­minolog­i­cal note here. We’ve de­cided to de­scribe as a learn­ing pro­cess, with “un­rig­gable learn­ing pro­cess” and “un­in­fluence­able learn­ing pro­cess” be­ing the terms if has ad­di­tional prop­er­ties. But the gen­eral in­cludes things we might not feel are any­thing like “learn­ing”, like the AI writ­ing down its own re­ward func­tion.

    So it might make more sense to re­serve “learn­ing” for the un­rig­gable pro­cesses, and call the gen­eral some­thing else. But this is a judge­ment call, and peo­ple gen­er­ally con­sider “ask your pro­gram­mer” or “look at the coin” to be a learn­ing pro­cesses, which are very much rig­gable. So I’ve de­cided to call the gen­eral a learn­ing pro­cess. ↩︎

  6. This “ex­pec­ta­tion” can be made fully rigor­ous, since re­ward func­tions form an af­fine space: you can take weighted av­er­ages of re­ward func­tion, . ↩︎