Full toy model for preference learning

1. Toy model for greater insight

This post will pre­sent a sim­ple agent with con­tra­dic­tory and un­der­defined “prefer­ences”, and will ap­ply the pro­ce­dure de­tailed in this re­search pro­ject to cap­ture all these prefer­ences (and meta-prefer­ences) in a sin­gle util­ity func­tion[1].

The toy model will start in a sim­pler form, and grad­u­ally add more de­tails, aiming to cap­ture al­most all of sec­tions 2 and 3 of the re­search agenda in a sin­gle place.

The pur­poses of this toy model are:

  1. To illus­trate the re­search agenda in ways that are eas­ier to grasp.

  2. To figure out where the re­search agenda works well and where it doesn’t (for ex­am­ple, work­ing through this ex­am­ple sug­gests to me that there may be a cleaner way of com­bin­ing par­tial prefer­ence than the one de­tailed here).

  3. To al­low other peo­ple to crit­i­cism the toy model (and through it, the re­search agenda) in use­ful ways.

  4. And, thus, ul­ti­mately to im­prove the whole re­search agenda.

So, the best cri­tiques are ones that in­crease the clar­ity of the toy model, those that ob­ject to spe­cific parts of it (es­pe­cially if al­ter­na­tives are sug­gested), and those that make it more re­al­is­tic for ap­ply­ing to ac­tual hu­mans.

1.1 How much to read this post

On a first pass, I’d recom­mend read­ing up to the end of Sec­tion 3, which syn­the­sises all the base level prefer­ences. Then Sec­tions 4, 5, and 6 deal with more ad­vanced as­pects, such as meta-prefer­ences and iden­tity prefer­ences. Fi­nally, sec­tions 7 and 8 are more spec­u­la­tive and un­der­defined, and deal with situ­a­tions where the defi­ni­tions be­come more am­bigu­ous.

2. The ba­sic set­ting and the ba­sic agent

Ideally, the toy ex­am­ple would start with some ac­tual agent that ex­ists in the ma­chine learn­ing liter­a­ture; an agent which uses mul­ti­ple over­lap­ping and slightly con­tra­dic­tory mod­els. How­ever, I haven’t been able to find a good ex­am­ple of such an agent, es­pe­cially not one that is suffi­ciently trans­par­ent and in­ter­pretable for this toy ex­am­ple.

So I’ll start by giv­ing the set­ting and the agent my­self; hope­fully, this will get re­placed by an­other ML agent in fu­ture iter­a­tions, but it serves as a good start­ing point.

2.1 The blegg and rube classifier

The agent is a robot that is tasked with sort­ing bleggs and rubes.

Time is di­vided into dis­crete timesteps, of which there will be a thou­sand in to­tal. The robot is in a room, with a con­veyor belt and four bins. Bleggs (blue eggs) and rubes (red cubes) pe­ri­od­i­cally ap­pear on the con­veyor belt. The con­veyor belt moves right by one square ev­ery timestep, mov­ing these ob­jects with it (and tak­ing them out of the room for­ever if they’re on the right of the con­veyor belt).

The pos­si­ble ac­tions of the robot are as fol­lows: move in the four di­rec­tions around it[2], pick up any small ob­ject (eg a blegg or a rube) from the four di­rec­tions around it (it can only do so if it isn’t cur­rently car­ry­ing any­thing), or drop a small ob­ject in any of the four di­rec­tions around it. Only one small ob­ject can oc­cupy any given floorsquare; if the robot en­ters an oc­cu­pied square, it crushes any ob­ject in it. Note that the de­tails of these dy­nam­ics shouldn’t be that im­por­tant, but I’m writ­ing them down for full clar­ity.

2.2 The par­tial preferences

The robot has many par­tial mod­els ap­pro­pri­ate for many cir­cum­stances. I am imag­in­ing it as an in­tel­li­gent agent that was slop­pily trained (or evolved) for blegg and rube clas­sifi­ca­tion.

As a con­se­quence of this train­ing, it has many in­ter­nal mod­els. Here we will look at the par­tial mod­els/​par­tial prefer­ences where ev­ery­thing is the same, ex­cept for the place­ment of a sin­gle rube. Be­cause we are us­ing one-step hy­pote­hti­cals, the robot is asked its prefer­ences in many differ­ent phras­ings; it doesn’t always give the same weight for the same prefer­ence ev­ery time its asked, so each par­tial prefer­ence comes with a mean weight and a stan­dard de­vi­a­tion.

Here are the rube-rele­vant prefer­ences with their weights. A rube...

  • : …in the rube bin is bet­ter than in the blegg bin: mean weight (std ).

  • .: ..in the rube bin is bet­ter than on a floor tile: mean weight (std ).

  • : …on any floor tile is bet­ter than in the blegg bin: mean weight (std ).

  • : …on a floor tile is bet­ter than on con­veyor belt: mean weight (std ).

This may seem to be to have too many de­tails for a toy model, but most of the sub­tleties in value learn­ing only be­come ap­par­ent when there are suffi­cient differ­ent base level val­ues.

We as­sume all these sym­bols and terms are grounded (though see later in this post for some re­lax­ation of this as­sump­tion).

Most other rube par­tial prefer­ences are com­pos­ite from these. So, for ex­am­ple, if the robot is asked to com­pare a rube in a rube bin with a rube on the con­veyor belt, it de­com­posed this into “rube bin vs floor tile” and “floor tile vs con­veyor belt”.

The for­mula for bleggs prefer­ences ( through ) are the same, ex­cept with the blegg bin and rube bin in­verted, and the robot has to each of these weights—the robot just doesn’t feel as strongly about bleggs.

There is an­other prefer­ence, that ap­plies to both rubes and bleggs on the con­veryor belt:

  • : For any ob­ject on the con­veyor belt, mov­ing one step to the right is bad (in­clud­ing out of the room): mean weight , std .

2.3 Floor: one spot is (al­most) as good as another

There is one last set of prefer­ences: when the robot is asked whether it prefers a rube (or a blegg) on one spot of the floor rather than an­other. Let be the set of all floor spaces. Then the agent’s prefer­ences can be cap­tured by an anti-sym­met­ric func­tion .

Then if asked its prefer­ence be­tween a rube (or blegg) on floor space ver­sus it be­ing on floorspace , the agent will pre­fer over iff (equiv­a­lently, iff ); the weight of this prefer­ence, , is .

3. Syn­the­sis­ing the util­ity function

3.1 Ma­chine learning

When con­struct­ing the agent’s util­ity func­tion, we first have to col­lect similar com­par­i­sons to­gether. This was already done im­plic­itly when talk­ing about the differ­ent “phras­ings” of hy­po­thet­i­cals com­par­ing, eg, the blegg bin with the rube bin. So the ma­chine learn­ing col­lects to­gether all such com­par­i­sons, which gives us the “mean weight” and the stan­dard de­vi­a­tion in the table above.

The other role of ma­chine learn­ing could be to es­tab­lish that one floor space is pretty much the same as an­other, and thus that the func­tion that differ­en­ti­ates them (of very low weight ) can be treated as ir­rele­vant.

This is the sec­tion of the toy ex­am­ple that could benefit most from a real ML agent as part of this ex­am­ple. Be­cause in this sec­tion, I’m es­sen­tially in­tro­duc­ing noise and “unim­por­tant” differ­ences in hy­po­thet­i­cal phras­ings, and claiming “yes, the ma­chine learn­ing al­gorithm will cor­rectly la­bel these as noise and unim­por­tant”.

The rea­son for in­clud­ing these ar­tifi­cial ex­am­ples, though, is to sug­gest the role I ex­pect ML to perform in more re­al­is­tic ex­am­ples.

3.2 The first synthesis

Given the as­sump­tions of the pre­vi­ous sub­sec­tion, and the par­tial prefer­ences listed above, the en­ergy-min­imis­ing for­mula of this post gives a util­ity func­tion that re­wards the fol­low­ing ac­tions (nor­mal­ised so that plac­ing small ob­jects on the floor/​pick­ing up small ob­jects give zero re­ward), we get:

  • Put­ting a rube in the rube bin: .

  • Put­ting a rube in the blegg bin: .

  • Put­ting a blegg in the rube bin: .

  • Put­ting a blegg in the blegg bin: .

Those are the weighted val­ues, de­rived from and , (note that these use the mean weights, not—yet—the stan­dard de­vi­a­tions). For sim­plic­ity of no­ta­tion, let be the util­ity/​re­ward func­tion that gives for a rube in the rube bin and for a rube in the blegg bin, and the op­po­site func­tion for a blegg.

Then, so far, the agents util­ity is roughly . Note that is be­tween the im­plied by and the im­plied by (and similarly for , and the im­plied by and the im­plied by ).

But there are also the prefer­ences con­cern­ing the con­veyor belt and the place­ment of ob­jects there, , and . Ap­ply­ing the en­ergy-min­imis­ing for­mula to and gives the fol­low­ing set of val­ues:

  • The util­ity of a world where a rube is on the con­veyor belt is given by the fol­low­ing val­ues, from the po­si­tions on the belt from left to right: , , , and .

  • When a rube leaves the room via the con­veyor belt, util­ity changes by that last value minus , ie by .

Now, what about ? Well, the weight of is , and the weight of is less than that. Since weights can­not be nega­tive, the weight of is , or, more sim­ply, does not ex­ist as a par­tial prefer­ence. So the cor­re­spond­ing prefer­ences for bleggs are much sim­pler:

  • The util­ity of a world where a blegg is on the con­veyor belt is given by the fol­low­ing val­ues, from the po­si­tions on the belt from left to right: , , , and .

  • When a blegg leaves the room via the con­veyor belt, util­ity changes by that last value minus , ie by .

Hum, is some­thing odd go­ing on there? Yes, in­deed. Com­mon sense tells us that these val­ues should go , , , , and end with when go­ing off the con­veyor belt, rather than ever be­ing pos­i­tive. But the prob­lem is that agent has no rel­a­tive prefer­ence be­tween bleggs on the con­veyor belt and bleggs on the floor (or any­where) - doesn’t ex­ist. So the cat­e­gories of “blegg on the con­veyor belt” and “blegg any­where else” are not com­pa­rable. When this hap­pens, we nor­mal­ise the av­er­age of each cat­e­gory to the same value (here, ). Hence the very odd val­ues of the blegg on the con­veyor belt.

Note that, most of the time, only the penalties for leav­ing the room mat­ter ( for a rube and for a blegg). That’s be­cause while the ob­jects are in the room, the agent can lift them off the con­veyor belt, and thus re­move the util­ity penalty for be­ing on it. Only at the end of the epi­sode, af­ter timesteps, will the re­main­ing bleggs or rubes on the con­veyor belt mat­ter.

Col­lect all these con­veyor belt util­ities to­gether as . Think the val­ues of are ugly and un­jus­tified? As we’ll see, so does the agent it­self.

But, for the mo­ment, all this data defines the first util­ity func­tion as

  • .

Feel free to stop read­ing here if the syn­the­sis of base-level prefer­ences is what you’re in­ter­ested in.

4. En­ter the meta-preferences

We’ll as­sume that the robot has the fol­low­ing two meta-prefer­ences.

  • : The prefer­ence con­cern­ing the rubes and bleggs should be sym­met­ric of each other (weight ).

  • : Only the prefer­ences in­volv­ing the bins should mat­ter (weight ).

Now, will wipe out the term en­tirely. This may seem sur­pris­ing, as the term in­cludes re­wards that vary be­tween and , while has weight only of . But re­call that the effect of meta-prefer­ences is to change the weights of lower level prefer­ences. And there are only two prefer­ences that don’t in­volve bins: and , of weight and , re­spec­tively. So will wipe them both out, re­mov­ing from the util­ities.

Now, at­tempts to make the prefer­ences with rube and blegg sym­met­ric, mov­ing each prefer­ence by in the di­rec­tion of the sym­met­ric ver­sion. This sets the weight of to , and to , while , and re­spec­tively go to weights , , and .

So the base level prefer­ences are al­most sym­met­ric, but not quite (they would be sym­met­ric if the weight of was or higher). Re-run­ning en­ergy min­imi­sa­tion with these val­ues gets:

  • .

5. En­ter the syn­the­sis meta-preferences

Could the robot have any meta-prefer­ences over the syn­the­sis pro­cess? There is one sim­ple pos­si­bil­ity: the pro­cess so far has not used the stan­dard de­vi­a­tion of the weights. We could ar­gue that par­tial prefer­ences where the weight has high var­i­ance should be pe­nal­ised for the un­cer­tainty. One plau­si­ble way of do­ing so would be:

  • : be­fore com­bin­ing prefer­ences, the weight of each is re­duced by the stan­dard de­vi­a­tion.

This (com­bined with the other, stan­dard meta-prefer­ences) re­duces the weight of the par­tial prefer­ences to:

  • .

Th­ese weights gen­er­ate .

Then if is the weight of and the de­fault weight of the stan­dard syn­the­sis pro­cess, the util­ity func­tion becomes

  • .

Note the crit­i­cal role of in this util­ity defi­ni­tion. For stan­dard meta-prefer­ences, we only care about their weight rel­a­tive to each other (and rel­a­tive to the weight of stan­dard prefer­ences). How­ever, for syn­the­sis meta-prefer­ences, the de­fault weight of the stan­dard syn­the­sis is also rele­vant.

6. Iden­tity preferences

We can add iden­tity prefer­ences too. Sup­pose that there were two robots in the same room, both clas­sify­ing these ob­jects. Their util­ity func­tions for in­di­vi­d­u­ally putting the ob­jects in the bins are and , re­spec­tively.

If asked, ahead of time, the robot would as­sign equal weight to them putting the rube/​blegg in the bins as to the other robot do­ing so[3]. Thus, based on their es­ti­mate at time , the cor­rect util­ity is:

  • .

How­ever, when asked mid-epi­sode about which agent should put the ob­jects in the bins, the robot’s an­swer is much more com­pli­cated, and de­pends on how many each robot has put in the bin already. The robot wants to gen­er­ate at least as much util­ity as its coun­ter­part. Un­til it does that, it re­ally pri­ori­tises boost­ing its own util­ity (); af­ter reach­ing or sur­pass­ing the other util­ity (), then it doesn’t mat­ter which util­ity is boosted.

If that re­sult were strict, then the agent’s ac­tual util­ity would be the min­i­mum of and . Let’s as­sume the data is not quite so strict, and that when we try and col­lapse these var­i­ous hy­po­thet­i­cals, we get a smoothed ver­sion of the min­i­mum, for some :

  • ,

with and .

So, note that iden­tity prefer­ences are of the same type as stan­dard prefer­ences[4]; we just ex­pect them to be more com­plex to define, be­ing non-lin­ear.

6.1 Cor­rect­ing ignorance

So, which ver­sions of the iden­tity prefer­ences are cor­rect - or ? In the ab­sence of meta-prefer­ences, we just need a de­fault pro­cess—should we pri­ori­tise the cur­rent es­ti­mate of par­tial-prefer­ences, or the ex­pected fu­ture es­ti­mates?

If the robot has meta-prefer­ences, this al­lows for the syn­the­sis pro­cess to cor­rect for the robot’s ig­no­rance. It may have a strong meta-prefer­ence for us­ing ex­pected fu­ture es­ti­mates, and it may be­lieve cur­rently that is what would re­sult from these. The syn­the­sis pro­cess, how­ever, knows bet­ter, and picks .

Feel free to stop read­ing here for stan­dard prefer­ences and meta-prefer­ences; the next two sec­tions deal with situ­a­tions where the ba­sic defi­ni­tions be­come am­bigu­ous.

7. Break­ing the bounds of the toy problem

7.1 Pur­ple cubes and eggs

As­sume the robot has a util­ity of the type . And, af­ter a while, the robot sees some pur­ple shapes en­ter, both cubes and eggs.

Ob­vi­ously a pur­ple cube is closer to a rube than to a blegg, but it is nei­ther. There seems to be three ob­vi­ous ways of ex­tend­ing the defi­ni­tion:

  • - same as be­fore, the pur­ple ob­jects are ig­nored. This needs some sort of sharp bound­ary be­tween a slightly-pur­ple rube and a slightly-red pur­ple cube.

  • . Here is the same as , but for a pur­ple cube rather than a rube (and is the same as , but for a pur­ple egg rather than a blegg). This defi­ni­tion can be ex­tended for other am­bigu­ous ob­jects; if any­thing is rube, then it is treated as a rube with re­ward scaled by .

  • , where in­volves putting a pur­ple cube in the sec­ond bin, and in­volve putting a pur­ple egg in the third bin. In this ex­am­ple, the two in­ter­me­di­ate bins are used to in­ter­po­late be­tween rube and blegg; the task of the robot is to sort out into which of the four bins the ob­ject best fits.

Which of , , and is a bet­ter ex­trap­o­la­tion of ? Well, this will de­pend on how ma­chine learn­ing ex­trap­o­lates con­cepts, or on de­fault choices we make, or on meta-prefer­ences of the agent.

7.2 Web of con­no­ta­tion: pur­ple icosahedron

Now imag­ine that a pur­ple icosa­he­dron en­ters the room, and the robot is run­ning ei­ther or (and thus doesn’t sim­ply ig­nore pur­ple ob­jects).

How should it treat this ob­ject? The colour is ex­actly in be­tween red and blue, while the shape can be seen close to a sphere (an icosa­he­dron is pretty round) or a cube (it has sharp edges and flat faces).

In fact, the web of con­no­ta­tions of rubes and bleggs looks like this:

.

The robot’s strongest con­no­ta­tion for the rube/​blegg is its colour, which doesn’t help here. But the next strongest is the sharp­ness of the edges (maybe the robot has sen­si­tive fingers). Thus the pur­ple icosa­he­dron is seen as closer to a rube. But this purely a con­se­quence of how the robot does sym­bol ground­ing; an­other robot could see it more as a blegg.

8. Avoid­ing am­bigu­ous dis­tant situations

Sud­denly, strange ob­jects start to ap­pear on the con­veyor belt. All sorts of shapes and sizes, colours, and tex­tures; some change shape as they move, some are al­ive, some float in the air, some ra­di­ate glo­ri­ous light. None of these looks re­motely like a blegg or a rube. This is, in the ter­minol­ogy of this post, an am­bigu­ous dis­tant situ­a­tion.

There is also a but­ton the agent can press; if it does so, strange ob­jects no longer ap­pear. This won’t net any more bleggs or rubes; how much might the robot value press­ing that but­ton?

Well, if the robot’s util­ity is of the type . Then it doesn’t value press­ing the but­ton at all. For it can just ig­nore all the strange ob­jects, and get pre­cisely the same util­ity as press­ing the but­ton would give it.

But now sup­pose the agent’s util­ity is of the type , where is the con­veyor-belt util­ity de­scribed above. Now a con­ser­va­tive agent might pre­fer to press the but­ton. It gets for any rube that trav­els through the room on a con­veyor belt (and for any blegg that does so). It doesn’t know if any of the strange ob­jects count as rubes; but it isn’t sure about that ei­ther. If an ob­ject en­ters that ranks as rube­blegg (note that need not be ; con­sider a rube glued to a blegg), then it might lose if the ob­jects ex­its the room. But it might lose or if it puts the ob­ject in the wrong bin.

Given con­ser­vatism as­sump­tions—that po­ten­tial loses in am­bigu­ous situ­a­tions rank higher than po­ten­tial gains—press­ing the but­ton to avoid that ob­ject might be worth up to to the robot[5].


  1. For this ex­am­ple, there’s no need to dis­t­in­guish re­ward func­tions from util­ity func­tions. ↩︎

  2. North, East, South, or West from its cur­rent po­si­tion. ↩︎

  3. This is based on the as­sump­tion that the weights in the defi­ni­tions and are the same, so we don’t need to worry about the rel­a­tive weights of those util­ities. ↩︎

  4. If we wanted to de­ter­mine the weights at a par­tic­u­lar situ­a­tion: let be the stan­dard weight for com­par­ing an ob­ject in a bin with an ob­ject some­where else (same for and for ). Then for the pur­pose of the ini­tial robot mov­ing the ob­ject to the bin, the weight of that com­par­i­son for given val­ues of and , is , while for the other robot mov­ing the ob­ject to the bin, the weight of that com­par­i­son is . ↩︎

  5. That’s the for­mula if the robot is ul­tra-fast or the ob­jects on the con­veyor belt are sparse. The true for­mula is more com­pli­cated, be­cause it de­pends on how easy it is for the robot to move the ob­ject to the bins with­out miss­ing out on other ob­jects. But, in any case, press­ing the but­ton can­not be worth more than , the max­i­mal penalty the robot gets if it sim­ply ig­nores the ob­ject. ↩︎