# Full toy model for preference learning

# 1. Toy model for greater insight

This post will present a simple agent with contradictory and underdefined “preferences”, and will apply the procedure detailed in this research project to capture all these preferences (and meta-preferences) in a single utility function^{[1]}.

The toy model will start in a simpler form, and gradually add more details, aiming to capture almost all of sections 2 and 3 of the research agenda in a single place.

The purposes of this toy model are:

To illustrate the research agenda in ways that are easier to grasp.

To figure out where the research agenda works well and where it doesn’t (for example, working through this example suggests to me that there may be a cleaner way of combining partial preference than the one detailed here).

To allow other people to criticism the toy model (and through it, the research agenda) in useful ways.

And, thus, ultimately to improve the whole research agenda.

So, the best critiques are ones that increase the clarity of the toy model, those that object to specific parts of it (especially if alternatives are suggested), and those that make it more realistic for applying to actual humans.

## 1.1 How much to read this post

On a first pass, I’d recommend reading up to the end of Section 3, which synthesises all the base level preferences. Then Sections 4, 5, and 6 deal with more advanced aspects, such as meta-preferences and identity preferences. Finally, sections 7 and 8 are more speculative and underdefined, and deal with situations where the definitions become more ambiguous.

# 2. The basic setting and the basic agent

Ideally, the toy example would start with some actual agent that exists in the machine learning literature; an agent which uses multiple overlapping and slightly contradictory models. However, I haven’t been able to find a good example of such an agent, especially not one that is sufficiently transparent and interpretable for this toy example.

So I’ll start by giving the setting and the agent myself; hopefully, this will get replaced by another ML agent in future iterations, but it serves as a good starting point.

## 2.1 The blegg and rube classifier

The agent is a robot that is tasked with sorting bleggs and rubes.

Time is divided into discrete timesteps, of which there will be a thousand in total. The robot is in a room, with a conveyor belt and four bins. Bleggs (blue eggs) and rubes (red cubes) periodically appear on the conveyor belt. The conveyor belt moves right by one square every timestep, moving these objects with it (and taking them out of the room forever if they’re on the right of the conveyor belt).

The possible actions of the robot are as follows: move in the four directions around it^{[2]}, pick up any small object (eg a blegg or a rube) from the four directions around it (it can only do so if it isn’t currently carrying anything), or drop a small object in any of the four directions around it. Only one small object can occupy any given floorsquare; if the robot enters an occupied square, it crushes any object in it. Note that the details of these dynamics shouldn’t be that important, but I’m writing them down for full clarity.

## 2.2 The partial preferences

The robot has many partial models appropriate for many circumstances. I am imagining it as an intelligent agent that was sloppily trained (or evolved) for blegg and rube classification.

As a consequence of this training, it has many internal models. Here we will look at the partial models/partial preferences where everything is the same, except for the placement of a single rube. Because we are using one-step hypotehticals, the robot is asked its preferences in many different phrasings; it doesn’t always give the same weight for the same preference every time its asked, so each partial preference comes with a mean weight and a standard deviation.

Here are the rube-relevant preferences with their weights. A rube...

: …in the rube bin is better than in the blegg bin: mean weight (std ).

.: ..in the rube bin is better than on a floor tile: mean weight (std ).

: …on any floor tile is better than in the blegg bin: mean weight (std ).

: …on a floor tile is better than on conveyor belt: mean weight (std ).

This may seem to be to have too many details for a toy model, but most of the subtleties in value learning only become apparent when there are sufficient different base level values.

We assume all these symbols and terms are grounded (though see later in this post for some relaxation of this assumption).

Most other rube partial preferences are composite from these. So, for example, if the robot is asked to compare a rube in a rube bin with a rube on the conveyor belt, it decomposed this into “rube bin vs floor tile” and “floor tile vs conveyor belt”.

The formula for bleggs preferences ( through ) are the same, except with the blegg bin and rube bin inverted, and the robot has to each of these weights—the robot just doesn’t feel as strongly about bleggs.

There is another preference, that applies to both rubes and bleggs on the converyor belt:

: For any object on the conveyor belt, moving one step to the right is bad (including out of the room): mean weight , std .

## 2.3 Floor: one spot is (almost) as good as another

There is one last set of preferences: when the robot is asked whether it prefers a rube (or a blegg) on one spot of the floor rather than another. Let be the set of all floor spaces. Then the agent’s preferences can be captured by an anti-symmetric function .

Then if asked its preference between a rube (or blegg) on floor space versus it being on floorspace , the agent will prefer over iff (equivalently, iff ); the weight of this preference, , is .

# 3. Synthesising the utility function

## 3.1 Machine learning

When constructing the agent’s utility function, we first have to collect similar comparisons together. This was already done implicitly when talking about the different “phrasings” of hypotheticals comparing, eg, the blegg bin with the rube bin. So the machine learning collects together all such comparisons, which gives us the “mean weight” and the standard deviation in the table above.

The other role of machine learning could be to establish that one floor space is pretty much the same as another, and thus that the function that differentiates them (of very low weight ) can be treated as irrelevant.

This is the section of the toy example that could benefit most from a real ML agent as part of this example. Because in this section, I’m essentially introducing noise and “unimportant” differences in hypothetical phrasings, and claiming “yes, the machine learning algorithm will correctly label these as noise and unimportant”.

The reason for including these artificial examples, though, is to suggest the role I expect ML to perform in more realistic examples.

## 3.2 The first synthesis

Given the assumptions of the previous subsection, and the partial preferences listed above, the energy-minimising formula of this post gives a utility function that rewards the following actions (normalised so that placing small objects on the floor/picking up small objects give zero reward), we get:

Putting a rube in the rube bin: .

Putting a rube in the blegg bin: .

Putting a blegg in the rube bin: .

Putting a blegg in the blegg bin: .

Those are the weighted values, derived from and , (note that these use the mean weights, not—yet—the standard deviations). For simplicity of notation, let be the utility/reward function that gives for a rube in the rube bin and for a rube in the blegg bin, and the opposite function for a blegg.

Then, so far, the agents utility is roughly . Note that is between the implied by and the implied by (and similarly for , and the implied by and the implied by ).

But there are also the preferences concerning the conveyor belt and the placement of objects there, , and . Applying the energy-minimising formula to and gives the following set of values:

The utility of a world where a rube is on the conveyor belt is given by the following values, from the positions on the belt from left to right: , , , and .

When a rube leaves the room via the conveyor belt, utility changes by that last value minus , ie by .

Now, what about ? Well, the weight of is , and the weight of is less than that. Since weights cannot be negative, the weight of is , or, more simply, does not exist as a partial preference. So the corresponding preferences for bleggs are much simpler:

The utility of a world where a blegg is on the conveyor belt is given by the following values, from the positions on the belt from left to right: , , , and .

When a blegg leaves the room via the conveyor belt, utility changes by that last value minus , ie by .

Hum, is something odd going on there? Yes, indeed. Common sense tells us that these values should go , , , , and end with when going off the conveyor belt, rather than ever being positive. But the problem is that agent has no relative preference between bleggs on the conveyor belt and bleggs on the floor (or anywhere) - doesn’t exist. So the categories of “blegg on the conveyor belt” and “blegg anywhere else” are not comparable. When this happens, we normalise the average of each category to the same value (here, ). Hence the very odd values of the blegg on the conveyor belt.

Note that, most of the time, only the penalties for leaving the room matter ( for a rube and for a blegg). That’s because while the objects are in the room, the agent can lift them off the conveyor belt, and thus remove the utility penalty for being on it. Only at the end of the episode, after timesteps, will the remaining bleggs or rubes on the conveyor belt matter.

Collect all these conveyor belt utilities together as . Think the values of are ugly and unjustified? As we’ll see, so does the agent itself.

But, for the moment, all this data defines the first utility function as

.

**Feel free to stop reading here if the synthesis of base-level preferences is what you’re interested in**.

# 4. Enter the meta-preferences

We’ll assume that the robot has the following two meta-preferences.

: The preference concerning the rubes and bleggs should be symmetric of each other (weight ).

: Only the preferences involving the bins should matter (weight ).

Now, will wipe out the term entirely. This may seem surprising, as the term includes rewards that vary between and , while has weight only of . But recall that the effect of meta-preferences is to change the weights of lower level preferences. And there are only two preferences that don’t involve bins: and , of weight and , respectively. So will wipe them both out, removing from the utilities.

Now, attempts to make the preferences with rube and blegg symmetric, moving each preference by in the direction of the symmetric version. This sets the weight of to , and to , while , and respectively go to weights , , and .

So the base level preferences are almost symmetric, but not quite (they would be symmetric if the weight of was or higher). Re-running energy minimisation with these values gets:

.

# 5. Enter the synthesis meta-preferences

Could the robot have any meta-preferences over the synthesis process? There is one simple possibility: the process so far has not used the standard deviation of the weights. We could argue that partial preferences where the weight has high variance should be penalised for the uncertainty. One plausible way of doing so would be:

: before combining preferences, the weight of each is reduced by the standard deviation.

This (combined with the other, standard meta-preferences) reduces the weight of the partial preferences to:

.

These weights generate .

Then if is the weight of and the default weight of the standard synthesis process, the utility function becomes

.

Note the critical role of in this utility definition. For standard meta-preferences, we only care about their weight relative to each other (and relative to the weight of standard preferences). However, for synthesis meta-preferences, the default weight of the standard synthesis is also relevant.

# 6. Identity preferences

We can add identity preferences too. Suppose that there were two robots in the same room, both classifying these objects. Their utility functions for individually putting the objects in the bins are and , respectively.

If asked, ahead of time, the robot would assign equal weight to them putting the rube/blegg in the bins as to the other robot doing so^{[3]}. Thus, based on their estimate at time , the correct utility is:

.

However, when asked mid-episode about which agent should put the objects in the bins, the robot’s answer is much more complicated, and depends on how many each robot has put in the bin already. The robot wants to generate at least as much utility as its counterpart. Until it does that, it really prioritises boosting its own utility (); after reaching or surpassing the other utility (), then it doesn’t matter which utility is boosted.

If that result were strict, then the agent’s actual utility would be the minimum of and . Let’s assume the data is not quite so strict, and that when we try and collapse these various hypotheticals, we get a smoothed version of the minimum, for some :

,

with and .

So, note that identity preferences are of the same *type* as standard preferences^{[4]}; we just expect them to be more complex to define, being non-linear.

## 6.1 Correcting ignorance

So, which versions of the identity preferences are correct - or ? In the absence of meta-preferences, we just need a default process—should we prioritise the current estimate of partial-preferences, or the expected future estimates?

If the robot has meta-preferences, this allows for the synthesis process to correct for the robot’s ignorance. It may have a strong meta-preference for using expected future estimates, **and** it may believe currently that is what would result from these. The synthesis process, however, knows better, and picks .

**Feel free to stop reading here for standard preferences and meta-preferences; the next two sections deal with situations where the basic definitions become ambiguous**.

# 7. Breaking the bounds of the toy problem

## 7.1 Purple cubes and eggs

Assume the robot has a utility of the type . And, after a while, the robot sees some purple shapes enter, both cubes and eggs.

Obviously a purple cube is closer to a rube than to a blegg, but it is neither. There seems to be three obvious ways of extending the definition:

- same as before, the purple objects are ignored. This needs some sort of sharp boundary between a slightly-purple rube and a slightly-red purple cube.

. Here is the same as , but for a purple cube rather than a rube (and is the same as , but for a purple egg rather than a blegg). This definition can be extended for other ambiguous objects; if anything is rube, then it is treated as a rube with reward scaled by .

, where involves putting a purple cube in the second bin, and involve putting a purple egg in the third bin. In this example, the two intermediate bins are used to interpolate between rube and blegg; the task of the robot is to sort out into which of the four bins the object best fits.

Which of , , and is a better extrapolation of ? Well, this will depend on how machine learning extrapolates concepts, or on default choices we make, or on meta-preferences of the agent.

## 7.2 Web of connotation: purple icosahedron

Now imagine that a purple icosahedron enters the room, and the robot is running either or (and thus doesn’t simply ignore purple objects).

How should it treat this object? The colour is exactly in between red and blue, while the shape can be seen close to a sphere (an icosahedron is pretty round) or a cube (it has sharp edges and flat faces).

In fact, the web of connotations of rubes and bleggs looks like this:

.

The robot’s strongest connotation for the rube/blegg is its colour, which doesn’t help here. But the next strongest is the sharpness of the edges (maybe the robot has sensitive fingers). Thus the purple icosahedron is seen as closer to a rube. But this purely a consequence of how the robot does symbol grounding; another robot could see it more as a blegg.

# 8. Avoiding ambiguous distant situations

Suddenly, strange objects start to appear on the conveyor belt. All sorts of shapes and sizes, colours, and textures; some change shape as they move, some are alive, some float in the air, some radiate glorious light. None of these looks remotely like a blegg or a rube. This is, in the terminology of this post, an ambiguous distant situation.

There is also a button the agent can press; if it does so, strange objects no longer appear. This won’t net any more bleggs or rubes; how much might the robot value pressing that button?

Well, if the robot’s utility is of the type . Then it doesn’t value pressing the button at all. For it can just ignore all the strange objects, and get precisely the same utility as pressing the button would give it.

But now suppose the agent’s utility is of the type , where is the conveyor-belt utility described above. Now a conservative agent might prefer to press the button. It gets for any rube that travels through the room on a conveyor belt (and for any blegg that does so). It doesn’t know if any of the strange objects count as rubes; but it isn’t sure about that either. If an object enters that ranks as rubeblegg (note that need not be ; consider a rube glued to a blegg), then it might lose if the objects exits the room. But it might lose or if it puts the object in the wrong bin.

Given conservatism assumptions—that potential loses in ambiguous situations rank higher than potential gains—pressing the button to avoid that object might be worth up to to the robot^{[5]}.

For this example, there’s no need to distinguish reward functions from utility functions. ↩︎

North, East, South, or West from its current position. ↩︎

This is based on the assumption that the weights in the definitions and are the same, so we don’t need to worry about the relative weights of those utilities. ↩︎

If we wanted to determine the weights at a particular situation: let be the standard weight for comparing an object in a bin with an object somewhere else (same for and for ). Then for the purpose of the initial robot moving the object to the bin, the weight of that comparison for given values of and , is , while for the other robot moving the object to the bin, the weight of that comparison is . ↩︎

That’s the formula if the robot is ultra-fast or the objects on the conveyor belt are sparse. The true formula is more complicated, because it depends on how easy it is for the robot to move the object to the bins without missing out on other objects. But, in any case, pressing the button cannot be worth more than , the maximal penalty the robot gets if it simply ignores the object. ↩︎

- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 134 points) (
- Ultra-simplified research agenda by 22 Nov 2019 14:29 UTC; 36 points) (
- [AN #73]: Detecting catastrophic failures by learning how agents tend to break by 13 Nov 2019 18:10 UTC; 11 points) (
- 10 Feb 2020 11:23 UTC; 3 points) 's comment on Research Agenda v0.9: Synthesising a human’s preferences into a utility function by (

Planned summary:

Planned opinion:

This is really handy. I didn’t have much to say, but revisited this recently and figured I’d write down the thoughts I

didthink.My general feeling about human models is that they need precisely one more level of indirection than this. Too many levels of indirection, and you get something that correctly predicts the world, but doesn’t contain something you can point to as the desires. Too few, and you end up trying to fit human examples with a model that doesn’t do a good job of fitting human behavior.

For example, if you build your model on responses to survey questions, then what about systematic human difficulties in responding to surveys (e.g. difficulty using a consistent scale across several orders of magnitude of value) that the humans themselves are unaware of? I’d like to use a model of humans that learns about this sort of thing from non-survey-question data.