Research Agenda v0.9: Synthesising a human’s preferences into a utility function

Stuart_Armstrong17 Jun 2019 17:46 UTC

LW: 70 AF: 20

I’m now in a position where I can see a possible route to a safe/survivable/friendly Artificial Intelligence being developed. I’d give a 10+% chance of it being possible this way, and a 95% chance that some of these ideas will be very useful for other methods of alignment. So I thought I’d encode the route I’m seeing as research agenda; this is the first public draft of it.

Clarity, rigour, and practicality: that’s what this agenda needs. Writing this agenda has clarified a lot of points for me, to the extent that some of it now seems, in retrospect, just obvious and somewhat trivial—“of course that’s the way you have to do X”. But more clarification is needed in the areas that remain vague. And, once these are clarified enough for humans to understand, they need to be made mathematically and logically rigorous—and ultimately, cashed out into code, and tested and experimented with.

So I’d appreciate any comments that could help with these three goals, and welcome anyone interested in pursuing research along these lines over the long-term.

Note: I periodically edit this document, to link it to more recent research ideas/discoveries.

0 The fundamental idea

This agenda fits itself into the broad family of Inverse Reinforcement Learning: delegating most of the task of inferring human preferences to the AI itself. Most of the task, since it’s been shown that humans need to build the right assumptions into the AI, or else the preference learning will fail.

To get these “right assumptions”, this agenda will look into what preferences actually are, and how they may be combined together. There are hence four parts to the research agenda:

A way of identifying the (partial^[1]) preferences of a given human $H$ .
A way for ultimately synthesising a utility function $U_{H}$ that is an adequate encoding of the partial preferences of a human $H$ .
Practical methods for estimating this $U_{H}$ , and how one could use the definition of $U_{H}$ to improve other suggested methods for value-alignment.
Limitations and lacunas of the agenda: what is not covered. These may be avenues of future research, or issues that cannot fit into the $U_{H}$ paradigm.

There has been a myriad of small posts on this topic, and most will be referenced here. Most of these posts are stubs that hint to a solution, rather than spelling it out fully and rigorously.

The reason for that is to check for impossibility results ahead of time. The construction of $U_{H}$ is deliberately designed to be adequate, rather than elegant (indeed, the search for an elegant $U_{H}$ might be counterproductive and even dangerous, if genuine human preferences get sacrificed for elegance). If this approach is to work, then the safety of $U_{H}$ has to be robust to different decisions in the synthesis process (see Section 2.8, on avoiding disasters). Thus, initially, it seems more important to find approximate ideas that cover all possibilities, rather than having a few fully detailed sub-possibilities and several gaps.

Finally, it seems that if a sub-problem is not formally solved, we stand a much better chance of getting a good result from “hit it with lots of machine learning and hope for the best”, than we would if there were huge conceptual holes in the method—a conceptual hole meaning that the relevant solution is broken in an unfixable way. Thus, I’m publishing this agenda now, where I see many implementation holes, but no large conceptual holes.

A word of warning here, though: with some justification, the original Dartmouth AI conference could also have claimed to be confident that there were no large conceptual holes in their plan of developing AI over a summer—and we know how wrong they turned out to be. With that thought in mind, onwards with the research agenda.

0.1 Executive summary: synthesis process

The first idea of the project is to identify partial preferences as residing within human mental models. This requires identifying the actual and hypothetical internal variables of a human, and thus solving the “symbol grounding problem” for humans; ways of doing that are proposed.

The project then sorts the partial preferences into various categories of interest (basic preferences about the world, identity preferences, meta-preferences about basic preferences, global meta-preferences about the whole synthesis project, etc...). The aim is then to synthesise these into a single utility function $U_{H}$ , representing the preference of the human $H$ (at a given time or short interval of time). Different preference categories play different roles in this synthesis (eg object-level preferences get aggregated, meta-preferences can modify the weights of object-level preferences, global meta-preferences are used at the design stage, and so on).

The aims are to:

Ensure the synthesis $U_{H}$ has good properties and reflects $H$ ‘s actual preferences, and not any of $H$ ’s erroneous factual beliefs.
Ensure that highly valued preferences weight more than lightly held ones, even if the lightly held one is more “meta” that the other.
Respect meta-preferences about the synthesis as much as possible, but...
...always ensure that the synthesis actually reaches an actual non-contradictory $U_{H}$ .

To ensure point 4. and 2., there will always be an initial way of synthesising preferences, which certain meta-preferences can then modify in specific ways. This is designed to resolve contradictions (when “I want a simple moral system” and “value is fragile and needs to be preserved” are both comparably weighted meta-preferences) and remove preference loops (“I want a simple moral system” is itself simple and could reinforce itself; “I want complexity in my values” is also simple and could undermine itself).

The “good properties” of 1. are established, in large part, by the global meta-preferences that don’t comfortably sit within the synthesis framework. As for erroneous beliefs, if $H$ wants to date $H^{'}$ because they think that would make them happy and respected, then an AI will synthesise “being happy” and “being respected” as preferences, and would push $H$ away from $H^{'}$ if $H$ were actually deluded about what dating them would accomplish.

That is the main theoretical contribution of the research agenda. It then examines what could be done with such a theory in practice, and whether the theory can be usefully approximated for constructing an actual utility function for an AI.

0.2 Executive summary: agenda difficulty and value

One early commentator on this agenda remarked:

[...] it seems like this agenda is trying to solve at least 5 major open problems in philosophy, to a level rigorous enough that we can specify them in code:

The symbol grounding problem.
Identifying what humans really care about (not just what they say they care about, or what they act like they care about) and what preferences and meta-preferences even are.
Finding an acceptable way of making incomplete and inconsistent (meta-)preferences complete and consistent.
Finding an acceptable way of aggregating many people’s preferences into a single function^[2].
The nature of personal identity.

I agree that AI safety researchers should be more ambitious than most researchers, but this seems extremely ambitious, and I haven’t seen you acknowledge the severe outside-view difficulty of this agenda.

This is indeed an extremely ambitious project. But, in a sense, a successful aligned AI project will ultimately have to solve all of these problems. Any situation in which most of the future trajectory of humanity is determined by AI, is a situation where there are solutions to all of these problems.

Now, these solutions may be implicit rather than explicit; equivalently, we might be able to delay solving them via AI, for a while. For example, a tool AI solves these issues by being contained in such a way that human judgement is capable of ensuring good outcomes. Thus humans solve the grounding problem, and we design our questions to the AI to ensure compatibility with our preferences, and so on.

But as the power of AIs increase, humans will become confronted by situations they have never been in before, and our ability to solve these issues diminish (and the probabilities increase that we might be manipulated or fall into a bad attractor). This transition may sneak up on us, so it is useful to start thinking of how to a) start solving these problems, and b) start identifying these problems crisply so we can know when and whether they need to be solved, and when we are moving out of the range of validity of the “trust humans” solution. For both these reasons, all the issues will be listed explicitly in the research agenda.

A third reason to include them is so that we know what we need to solve those issues for. For example, it is easier to assess the quality of any solution to symbol grounding, if we know what we’re going to do with that solution. We don’t need a full solution, just one good enough to define human partial preferences.

And, of course, we need to also consider scenarios where partial approaches like tool AI just don’t work, or only work if we solve all the relevant issues anyway.

Finally, there is a converse: partial solutions to problems in this research agenda can contribute to improving other methods of AI safety alignment. Section 3 will look into this in more detail. The basic idea is that, to improve an algorithm or an approach, it is very useful to know what we are ultimately trying to do (eg compute partial preferences, or synthesise a utility function with certain acceptable properties). If we rely only on making local improvements, guided by intuition, we may ultimately get stuck when intuition runs out; and the improvements are more likely to be ad-hoc patches than consistent, generalisable rules.

0.3 Executive aside: the value of approximating the theory

The theoretical construction of $U_{H}$ in Sections 1 and 2 is a highly complicated object, involving millions of unobserved counterfactual partial preferences and a synthesis process involving higher-order meta-preferences. Section 3 touches on how $U_{H}$ could be approximated, but, given its complexity, it would seem that the answer would be “only very badly”.

And there is a certain sense in which this is correct. If $U_{V}$ is the actual idealised utility defined by the process, and $V_{H}$ is the approximated utility that a real-world AI could compute, then it is likely^[3] that $U_{H}$ and $V_{H}$ will be quite different in many formal senses.

But there is a certain sense in which this is incorrect. Consider many of the AI failure scenarios. Imagine that the AI, for example, extinguished all meaningful human interactions because these can sometimes be painful and the AI knows that we prefer to avoid pain. But it’s clear to us that most people’s partial preferences will not endorse total loneliness as good outcome; if it’s clear to us, then it’s a fortiori clear to a very intelligent AI; hence the AI will avoid that failure scenario.

One should be careful with using arguments of this type, but it is hard to see how there could be a failure mode that a) we would clearly understand is incompatible with proper synthesis of $U_{H}$ , but b) a smart AI would not. And it seems that any failure mode should be understandable to us, as a failure mode, especially given some of the innate conservatism of the construction of $U_{H}$ .

Hence, even if $V_{H}$ is a poor approximation of $U_{H}$ in a certain sense, it is likely an excellent approximation of $V_{H}$ in the sense of avoiding terrible outcomes. So, though $d (U_{H}, V_{H})$ might be large for some formal measure of distance $d$ , a world where the AI maximises $V_{H}$ will be highly ranked according to $U_{H}$ .

0.4 An inspiring just-so story

This is the story of how evolution created humans with preferences, and what the nature of these preferences are. The story is not true, in the sense of accurate; instead, it is intended to provide some inspiration as to the direction of this research agenda. This section can be skipped.

In the beginning, evolution created instinct driven agents. These agents had no preferences or goals, nor did they need any. They were like Q-learning agents: they knew the correct action to take in different circumstances, but that was it. Consider baby turtles that walk towards the light upon birth, because, traditionally, the sea was lighter than the land—of course, this behaviour fails them in the era of artificial lighting.

But evolution has a tiny bandwidth, acting once per generation. So it created agents capable of planning, of figuring out different approaches, rather than having to follow instincts. This was useful, especially in varying environments, and so evolution offloaded a lot of its “job” onto the planning agents.

Of course, to be of any use, the planning agents need to be able to model their environment to some extent (or else their plans can’t work) and had to have preferences (or else every plan was as good as another). So, in creating the first planning agents, evolution created the first agents with preferences.

Of course, evolution is a messy, undirected process, so the process wasn’t clean. Planning agents are still riven with instincts, and the modelling of the environment is situational, used for when it was needed, rather than some consistent whole. Thus the “preferences” of these agents were underdefined and sometimes contradictory.

Finally, evolution created agents capable of self-modelling and of modelling other agents in their species. This might have been because of competitive social pressures as agents learn to lie and detect lying. Of course, this being evolution, this self-and-other-modelling took the form of kludges built upon spandrels built upon kludges.

And then arrived humans, who developed norms and norm-violations. As a side effect of this, we started having higher-order preferences as to what norms and preferences should be. But instincts and contradictions remained—this is evolution, after all.

And evolution looked upon this hideous mess, and saw that it was good. Good for evolution, that is. But if we want it to be good for us, we’re going to need to straighten out this mess somewhat.

1 The partial preferences of a human

The main aim of this research agenda is to start with a human $H$ at or around a given moment $t$ and produces a utility function $U_{H_{t}}$ which is an adequate synthesis of the human’s preferences at the time $t$ . Unless the dependence on $t$ needs to be made explicit, this will simply be designated as $U_{H}$ .

Later sections will focus on what can be done with $U_{H}$ or the methods used for its construction; this section and the next will focus solely on that construction. It is mainly based on these posts, with some commentary and improvements.

Essentially the process is to identify human preferences and meta-preferences within human (partial) mental model (Section 1), and find some good way of synthesising these into a whole $U_{H}$ (Section 2).

Partial preferences (see Section 1.1) will be decomposed into:

Partial preferences about the world.
Partial preferences about our own identity.
Partial meta-preferences about our preferences.
Partial meta-preferences about the synthesis process.
Self-referential contradictory partial meta-preferences.
Global meta-preferences about the outcome of the synthesis process.

This section and the next will lay out how preferences of types 1, 2, 3, and 4 can be used to synthesise the $U_{H}$ . Section 2 will conclude by looking what role preferences of type 6 can play. Preferences of type 5 are not dealt with in this agenda, and remain a perennial problem (see Section 4.5).

1.1 Partial models, partial preferences

As was shown in the paper “Occam’s razor is insufficient to infer the preferences of irrational agents”, an agent’s behaviour is never enough to establish their preferences—even with simplicity priors or regularisation (see also this post and this one).

Therefore a definition of preference needs to be grounded in something other than behaviour. There are further arguments, presented here, as to why a theoretical grounding is needed even when practical methods are seemingly adequate; this point will be returned to later.

The first step is to define a partial preference (and a partial model for these to exist in). A partial preference is a preference that exists within a human being’s internal mental model, and which contrasts two^[4] situations along a single axis of variation, keeping other aspects constant. For example, “I wish I was rich (rather than poor)”, “I don’t want to go down that alley, lest I get mugged”, and “this is much worse if there are witnesses around” are all partial preferences. A more formal definition of partial preferences, and the partial mental model in which they exist, is presented here.

Note that this is one of the fundamental theoretical underpinnings of the method. It identifies human (partial) preferences as existing within human mental models. This is a “normative assumption”: we choose to define these features as (partial) human preferences, the universe does not compel us to do so.

This definition gets around the “Occam’s razor” impossibility result, since these mental models are features of the human brain’s internal process, not of human behaviour. Conversely, this also violates certain versions of functionalism, precisely because the internal mental states are relevant.

A key important feature is to extract not only the partial preferences itself, but the intensity of the preferences, referred to as its weight. This will be key in combining the preferences together (technically, we only need the weight relative to other partial preferences).

1.2 Symbol grounding

In order to interpret what a partial model means, we need to solve the old problem of symbol grounding. “I wish I was rich” was presented as an example of a partial preference; but how can we identify “I”, “rich” and the counterfactual “I wish”, all within the mess of the neural net that is the human brain?

To ground these symbols, we should approach the issue of symbol grounding empirically, by aiming to predict the values of real world-variables through knowledge of internal mental variables (see also the example presented here). This empirical approach can provide sufficient grounding for the purposes of partial models, even if symbol grounding is not solved in the traditional linguistic sense of the problem.

This is because each symbol has a web of connotations, a collection of other symbols and concepts that co-vary with it, in normal human experience. Since the partial models are generally defined to be within normal human experiences, there is little difference between any symbols that are strongly correlated.

To formalise and improve this definition, we’ll have to be careful about how we define the internal variables in the first place—overly complicated or specific internal variables can be chosen to correlate artificially well with external variables. This is, essentially, “symbol grounding overfitting”.

Another consideration is the extent to which the model is conscious or subconscious; aliefs, for example, could be modelled as subconscious partial preferences. For consciously endorsed aliefs, this is not much of a problem—we instinctively fear touching fires, and don’t desire to lose that fear. But if we don’t endorse that alief—for example, we might fear flying and not want to fear it—this becomes more tricky. Things get confusing with partially endorsed aliefs: amusement park rides are extremely safe, and we wouldn’t want to be crippled with fear at the thought of going on one. But neither would we want the experience to feel perfectly bland and safe.

1.3 Which (real and hypothetical) partial models?

Another important consideration is that humans do not have, at the moment $t$ , a complete set of partial models and partial preferences. They may have a single partial model in mind, with maybe a few others in the background—or they might not be thinking about anything like this at all. We could extend the parameters to some short period around the time $t$ (reasoning that people’s preferences rarely change in such a short time), but though that gives us more data, it doesn’t give us nearly enough.

The most obvious way to get a human to produce an internal model is to ask them a relevant question. But we have to be careful about this—since human values are changeable and manipulable, the very act of asking a question can cause humans to think in certain directions, and even create partial preferences where none existed. The more interaction between the questioner and the human, the more extreme preferences can be created. If the questioner is motivated to maximise the utility function that it is also computing (i.e. if the $U_{H}$ is an online learning process), then the questioner can rig or influence the learning process.

Fortunately, there are ways of removing the questioner’s incentives to rig or influence the learning process.

Thus the basic human preferences at time $t$ are defined to be those partial models produced by “one-step hypotheticals”^[5]. These are questions that do not cause the human to be put in unusual mental situations, and try and minimise any departure from the human’s base-state. We need to distinguish between simple and composite partial preferences: the latter happen when a hypothetical question elicits a long chain of reasoning, covering multiple partial preferences, rather than a single clear answer based on a single internal model.

Some preferences are conditional (eg “I want to eat something different from what I’ve eat so far this week”), as are some meta-preferences (eg “If I hear a convincing argument about X being good, I want to prefer X”), which could violate the point of the one-step hypothetical. Thus conditional (meta-)preferences are only acceptable if their conditions are achieved by short streams of data, unlikely to manipulate the human. They also should be weighted more if they fit a consistent narrative of what the human is/wants to be, rather than being ad hoc (this will be assessed by machine learning, see Section 2.4).

Note that among the one-step hypotheticals, are included questions about rather extreme situations—heaven and hell, what to do if plants were conscious, and so on. In general, we should reduce the weight^[6] of partial preferences in extreme situations^[7]. This is because of the unfamiliarity of these situations, and because the usual human web of connotations between concepts may have broken down (if a plant was conscious, would it be a plant in the sense we understand that?). Sometimes the breakdown is so extreme that we can say that the partial preference is factually wrong. This includes effects like the hedonic treadmill: our partial models of achieving certain goals often include an imagined long-term satisfaction that we would not actually feel. Indeed, it might be good to specifically avoid these extreme situations, rather than having to make a moral compromise that might lose part of $H$ ’s values due to uncertainty. In that case, ambiguous extreme situations get a slight intrinsic negative—that might be overcome by other considerations, but is there nonetheless.

A final consideration is that some concepts just disintegrate in general environments—for example, consider a preference for “natural” or “hand-made” products. In those cases, the web of connotations can be used to extract some preferences in general—for example, “natural”, used in this way has connotations^[8] of “healthy”, “traditional”, and “non-polluting”, all of which extend better to general environments than “natural” does. Sometimes, the preference can be preserved but routed around: some versions of “no artificial genetic modifications” could be satisfied by selective breeding that achieved the same result. And some versions couldn’t; it’s all a function of what powers the underlying preference: specific techniques, or a general wariness of these types of optimisation. Meta-preferences might be very relevant here.

2 Synthesising the preference utility function

Here we will sketch out the construction of the human utility function $U_{H}$ , from the data that is the partial preferences and their (relative) weights.

This is not, by any means, the only way of constructing $U_{H}$ . But it is illustrative of how the utility could be constructed, and can be more usefully critiqued and analysed than a vaguer description.

2.1 What sort of utility function?

Partial preferences are defined over states of the world or states of the human $H$ . The later included both things like “being satisfied with life” (purely internal) and “being an honourable friend” (mostly about $H$ ’s behaviour).

Consequently, $U_{H}$ must also be defined over such things, so $U_{H}$ is dependent on states of the world and states of the human $H$ . Unlike standard MDP-like situations, these states can include the history of the world or of $H$ up to that point—preferences like “don’t speak ill of the dead” abound in humans.

2.2 Why a utility function?

Why should we aim to synthesise a utility function, when human preferences are very far from being utility functions?

It’s not of an innate admiration for utility functions, or a desire for mathematical elegance. It’s because they tend to be stable under self-modification. Or, to be more accurate, they seem to be much more stable than preferences that are not utility functions.

In the imminent future, human preferences are likely to become stable and unchanging. Therefore it makes more sense to create a preference synthesis that is already stable, that create a potentially unstable one and let it randomly walk itself to stability (though see Section 4.6).

Also, and this is one of the motivations behind classical inverse reinforcement learning, reward/utility functions tend to be quite portable, and can be moved from one agent to another or from one situation to another, with greater ease than other goal structures.

2.3 Extending and normalising partial preferences

Human values are changeable, manipulable, underdefined, and contradictory. By focusing around time $t$ , we have removed the changeable problem for partial preferences (see this post for thoughts on how long a period around $t$ should be allowed); manipulable has been dealt with by removing the possibility of the AI influencing the learning process.

Being underdefined remains a problem, though. It would be possible to overfit absurdly specifically to the human’s partial models, and generate a $U_{H}$ that is in full agreement with our partial preferences and utterly useless. So the first thing to do is to group the partial preferences together according to similarity (for example, preferences for concepts closely related in terms of webs of connotations should generally be grouped together), and generalise them in some regularised way. Generalise means, here, that they are transformed into full preferences, comparing all possible universes. Though this would only be comparing on the narrow criteria that were used for the partial preference: a partial preference fear of being mugged could generalise to a fear of pain/violence/violation/theft across all universes, but would not include other aspects of our preferences. So they are full preferences, in terms of applying to all situations, but not the full set of our preferences, in terms of taking into account all our partial preferences.

It seems that standard machine learning techniques should already be up to the task of making full preferences from collections of partial preferences (with all the usual current problems). For example, clustering of similar preferences would be necessary. There are unsupervised ML algorithms that can do that; but even supervised ML algorithms end up grouping labelled data together in ways that define extensions of the labels into higher dimensional space. Where could these labels come from? Well, they could come from grounded symbols within meta-preferences. A meta-preference of the form “I would like to be free of bias” contains some model of what “bias” is; if that meta-preference is particularly weighty, then clustering preferences by whether or not they are biases could be a good thing to do.

Once the partial preferences are generalised in this way, remains the problem of them being contradictory. This is not as big a problem as it may seem. First of all, it is very rare for preferences to be utterly opposed: there is almost always some compromise available. So an altruist with murderous tendencies could combine charity work with aggressive online gaming; indeed some whole communities (such as BDSM) are designed to balance “opposing” desires for risk and safety.

So in general, the way to deal with contradictory preferences is to weight them appropriately, then add them together; any compromise will then appear naturally from the weighted sum^[9].

To do that, we need to normalise the preferences in some way. We might seek to do this in an a priori, principled way, or through partial models that include the tradeoffs between different preferences. Preferences that pertain to extreme situations, far removed from everyday human situations, could also be penalised in this weighting process (as the human should be less certain about these).

Now that the partial preferences have been identified and weighted, the challenge is to synthesise them into a single $U_{H}$ .

2.4 Synthesising the preference function: first step

So this is how one could do the first step of preference synthesis:

Group similar partial preferences together, generalise them to full preferences without overfitting.
Use partial models to compute the relative weight between different partial preferences.
Using those relative weights, and again without overfitting, synthesise those preferences into a single utility function $U_{H}^{0}$ .

This all seems doable in theory within standard machine learning. See Section 2.3 and the discussion of clustering for point 1. Point 2. comes from the definition of partial preferences. And point 3. is just an issue of fitting a good regularised approximation to noisy data.

In certain sense, this process is the partial opposite how Jacob Falkovich used a spreadsheet to find a life partner. In that process, he started by factoring his goal of having a life-partner in many different subgoals. He then ranked the putative partners on each of the subgoals by comparing two options at a time, and building a (cardinal) ranking from these comparisons. The process here also aims to assign cardinal values from comparisons of two options, but the construction of the “subgoals” (full preferences) is handled by machine learning from the sets of weighted comparisons.

2.5 Identity preferences

Some preferences are best understood as pertaining to our own identity. For example, I want to understand how black holes work; this is separate from my other preference that some humans understand black holes (and separate again from an instrumental preference that, had we a convenient black hole close to hand, that we could use it to get energy out of).

Identity preferences seem to be different from preferences about the world; they seem more fragile than other preferences. We could combine identity preference differently from standard preferences, for example using smoothmin rather than summation. Gratifications seem to be particular types of identity preferences: these are preferences about how we achieved something, rather than what we achieved (eg achieving a particularly clever or impressive victory in a game, rather than just achieving a victory).

Ultimately, the human’s mental exchange rate between preferences should determine how preferences are combined. This should allow us to treat identity and world-preferences in the same way. There are two reasons to still distinguish between world-preferences and identity preferences:

For preferences where relative weights are unknown or ill-defined, linear combinations and smooth-min serve as a good default for world-preferences and identity preferences respectively.
It’s not certain that identity can be fully captured by partial preferences; in that case, identity preferences could serve as a starting point from which to build a concept of human identity.

2.6 Synthesising the preference function: meta-preferences

Humans generally have meta-preferences: preferences over the kind of preferences they should have (often phrased as preferences over their identity, eg “I want to be more generous”, or “I want to have consistent preferences”).

This is such an important feature of humans, that it needs its own treatment; this post first looked into that.

The standard meta-preferences endorse or unendorse lower lever preferences. First one can combine them as in the method above, and get a synthesised meta-preference. Then this increases or decreases the weights of the lower level preferences, to reach a $U_{H}^{n}$ with preference weights adjusted by the synthesised meta-preferences.

Note that this requires some ordering of the meta-preferences: each meta-preference refers only to meta-preferences “below” itself. Self-referential meta-preferences (or, equivalently, meta-preferences referring to each other in a cycle) are more subtle to deal with, see Section 4.5.

Note that an ordering does not mean that the higher meta-preferences must dominate the lower ones; a weakly held meta-preference (eg a vague desire to fit in with some formal standard of behaviour) need not overrule a strongly held object level preference (eg a strong love for a particular person, or empathy for an enemy).

2.7 Synthesising the preference function: meta-preference about synthesis

In a special category are the meta-preference about the synthesis process itself. For example, philosophers might want to give greater weight to higher order meta-preferences, or might value the simplicity of the whole $U_{H}$ .

One can deal with that by using the standard synthesis (of Section 2.4) to combine the method meta-preferences, then use this combination to change how standard preferences are synthesised. This old post has some examples of how this could be achieved. Note that these meta-preferences include preferences over using rationality to decide between lower-level preference.

As long as there is an ordering of meta-preferences about synthesis, one can use the standard method to synthesise the highest level of meta-preferences, which then tells us how to synthesise the lower-level meta-preferences about synthesis, and so on.

Why use the standard synthesis method for these meta-preferences—especially if they contradict this synthesis method explicitly? There are three reasons for this:

These meta-preferences may be weakly weighted (hence weakly held), so they should not automatically overwhelm the standard synthesis process when applied to themselves (think of continuity as the weight of the meta-preference fades to zero).
Letting meta-preferences about synthesis determine how they themselves get synthesised leads to circular meta-preferences, which may cause problems (see Section 4.5).
The standard method is more predictable, which makes the whole process more predictable; self-reference, even if resolved, could lead to outcomes randomly far away from the intended one. Predictability could be especially important for “meta-preferences over outcomes” of the next section.

Note that these synthesis meta-preferences should be of a type that affects the synthesis of $U_{H}$ , not its final form. So, for example, “simple (meta-)preferences should be given extra weight in $U_{H}$ ” is valid, while ” $U_{H}$ should be simple” is not.

Thus, finally, we can combine everything (except for some self-referencing contradictory preferences) into one $U_{H}$ .

Note there are many degrees of freedom in how the synthesis could be carried out; it’s hoped that they don’t matter much, and that each of them will reach a $U_{H}$ that avoids disasters^[10] (see Section 2.8).

2.8 Avoiding disasters, and global meta-preferences

It is important that we don’t end up in some disastrous outcome; the very definition of a good human value theory requires this.

The approach has some in-built protection against many types of disasters. Part of that is that it can include very general and universal partial preferences, so any combination of “local” partial preferences must be compatible with these. For example, we might have a collection of preferences about autonomy, pain, and personal growth. It’s possible that, when synthesising these preferences together, we could end up with some “kill everyone” preference, due to bad extrapolation. However, if we have a strong “don’t kill everyone” preference, this will push the synthesis process away from that outcome.

So some disastrous outcomes of the synthesis should be avoided, precisely because all of $H$ ’s preferences are used, including those that would specifically label that outcome a disaster.

But, even if we included all of $H$ ’s preferences in the synthesis, we’d still want to be sure we’d avoided disasters.

In one sense, this requirement is trivially true and useful. But in another, it seems perverse and worrying—the $U_{H}$ is supposed to be a synthesis of true human preferences. By definition. So how could this $U_{H}$ be, in any sense, a disaster? Or a failure? What criteria—apart from our own preferences—could we use? And shouldn’t we be using these preferences in the synthesis itself?

The reason that we can talk about $U_{H}$ not being a disaster, is that not all our preferences can best be captured in the partial model formalism above. Suppose one fears a siren world or reassures oneself that we can never encounter an indescribable hellworld. Both of these could be clunkily transformed into standard meta-preferences (maybe about what some devil’s advocate AI could tell us?). But that somewhat misses the point. These top-meta-level considerations live most naturally at the top-meta-level: reducing them to the standard format of other preferences and meta-preferences risks losing the point. Especially when we only partially understand these issues, translating them to standard meta-preferences risks losing the understanding we do have.

So, it remains possible to say that $U_{H}$ is “good” or “bad”, using higher level considerations that are difficult to capture entirely within $U_{H}$ .

For example, there is an argument that human preference incoherence should not cost us much. If true, this argument suggests that overfitting to the details of human preferences is not as bad as we might fear. One could phrase this as a synthesis meta-preference allowing more over-fitting, but this doesn’t capture a coherent meaning of “not as bad”—which precludes the real point of this argument, which is “allow more overfitting if the argument holds”. To use that, we need some criteria for establishing “the argument holds”. This seems very hard to do within the synthesis process, but could be attempted as top-level meta-preferences.

We should be cautious and selective when using these top-level preferences in this way. This is not generally the point at which we should be adding preferences to $U_{H}$ ; that should be done when constructing $U_{H}$ . Still, if we have a small selection of criteria, we could formalise these and check ourselves whether $U_{H}$ satisfies them, or have an AI do so while synthesising $U_{H}$ . A Last Judge can be a sensible precaution (especially if there are more downsides to error than upsides to perfection).

Note that we need to distinguish between the global meta-preferences of the designers (us) and those of the subject $H$ . So, when designing the synthesis process, we should either allow options to be automatically changed by $H$ ‘s global preferences, or be aware that we are overriding them with our own judgement (which may be inevitable, as most $H$ ’s have not thought deeply about preference synthesis; still, it is good to be aware of this issue).

This is also the level at which experimental testing of $U_{H}$ synthesis is likely to be useful—keeping in mind what we expect from $U_{H}$ synthesis, and running the synthesis in some complicated toy environments, we can see whether our expectations are correct. We may even discover extra top-level desiderata this way.

2.9 How much to delegate to the process

The method has two types of basic preferences (world-preferences and identity preferences). This is a somewhat useful division; but there are others that could have been used. Altruistic versus selfish versus anti-altruistic preferences is a division that was not used (though see Section 4.3). Moral preferences were not directly distinguished from non-moral preferences (though some human meta-preferences might make the distinction).

So, why divide preferences this way, rather than in some other way? The aim is to allow the process itself to take into account most of the divisions that we might care about; things that go into the model explicitly are structural assumptions that are of vital importance. So the division between world- and identity preferences was chosen because it seemed absolutely crucial to get that right (and to err on the side of caution in distinguishing the two, even if our own preferences don’t distinguish them as much). Similarly, the whole idea of meta-preferences seems a crucial feature of humans, which might not be relevant for general agents, so it was important to capture it. Note that meta-preferences are treated as a different type to standard preferences, with different rules; most distinctions built into the synthesis method should similarly be between objects of a different type.

But this is not set in stone; global meta-preferences (see Section 2.8) could be used to justify a different division of preference types (and different methods of synthesis). But it’s important to keep in mind what assumptions are being imposed from outside the process, and what the method is allowed to learn during the process.

3 $U_{H}$ in practice

3.1 Synthesis of $U_{H}$ in practice

If the definition of $U_{H}$ of the previous section could be made fully rigorous, and if the AI has a perfect model of $H$ ’s brain, knowledge of the universe, and unlimited computing power, it could construct $U_{H}$ perfectly and directly. This will almost certainly not be the case; so, do all these definitions give us something useful to work with?

It seems they do. Even extreme definitions can be approximated, hopefully to some good extent (and the theory allows us to assess the quality of the approximation, as opposed to another method without theory, where there is no meaningful measure of approximation ability). See Section 0.3 for an argument as to why even very approximate versions of $U_{H}$ could result in very positive outcomes: even approximated $U_{H}$ rule out most bad AI failure scenarios.

In practical terms, the synthesis of $U_{H}$ from partial preferences seems quite robust and doable; it’s the definition of these partial preferences that seems tricky. One might be able to directly see the internal symbols in the human brain, with some future super-version of fMRI. Even without that direct input, having a theory of what we are looking for—partial preference in partial models with human symbols grounded—allows us to use results from standard and moral psychology. These results are insights into behaviour, but they are often also, at least in part, insights into how the human brain processes information. In Section 3.3, we’ll see how the definition of $U_{H}$ allows us to “patch” other, more classical methods of value alignment. But the converse is also true: with a good theory, we can use more classical methods to figure out $U_{H}$ . For example, if we see $H$ as being in a situation where they are likely to tell the truth about their internal model, then their stated preferences become good proxies for their internal partial preferences.

If we have a good theory for how human preferences change over time, then we can use preferences at time $t^{'}$ as evidence for the hypothetical preferences at time $t$ . In general, more practical knowledge and understanding would lead to a better understanding of the partial preferences and how they change over time.

This could become an area of interesting research; once we have a good theory, it seems there are many different practical methods that suddenly become usable.

For example, it seems that humans model themselves and each other using very similar methods. This allows us to use our own judgement of irrationality and intentionality, to some extent, and in a principled way, to assess the internal models of other humans. As we shall see in Section 3.3, an awareness of what we are doing—using the similarity between our internal models and those of others—also allows us to assess when this method stops working, and patch it in a principled way.

In general, this sort of research would give results of the type “assuming this connection between empirical facts and internal models (an assumption with some evidence behind it), we can use this data to estimate internal models”.

3.2 (Avoiding) uncertainty and manipulative learning

There are arguments that, as long as we account properly for our uncertainty and fuzziness, there are no Goodhart-style problems in maximising an approximation to $U_{H}$ . This argument has been disputed, and there are ongoing debates about it.

With a good definition of what it means for the AI to influence the learning process, online learning of $U_{H}$ becomes possible, even for powerful AIs learning over long periods of time in which the human changes their views (either naturally or as a consequence of the AI’s actions).

Thus, we could construct an online version of inverse reinforcement learning without assuming rationality, where the AI learns about partial models and human behaviour simultaneously, constructing the $U_{H}$ from observations given the right data and the right assumptions.

3.3 Principled patching of other methods

Some of the theoretical ideas presented here can be used to improve other AI alignment ideas. This post explains one of the ways this can happen.

The basic idea is that there exist methods—stated preferences, revealed preferences, an idealised human reflecting for a very long time—that are often correlated with $U_{H}$ and with each other. However, all of the methods fail—stated preferences are often dishonest (the revelation principle doesn’t apply in the social world), revealed preferences assume a rationality that is often absent in humans (and some models of revealed preferences obscure how unrealistic this rationality assumption is), humans that think for a long time have the possibility of value drift or random walks to convergence.

Given these flaws, it is always tempting to patch the method: add caveats to get around the specific problem encountered. However, if we patch and patch until we can no longer think of any further problems, that doesn’t mean there are no further problems: simply that they are likely beyond our capacity to predict ahead of time. And, if all that it has is a list of patches, the AI is unlikely to be able to deal with these new problems.

However, if we keep the definition of $U_{H}$ in mind, we can come up with principled reasons to patch a method. For example, lying on stated preferences means a divergence between stated preferences and internal model; revealed preferences only reveal within the parameters of the partial model that is being used; and value drift is a failure of preference synthesis.

Therefore, each patch can have an explanation for the divergence between method and desired outcome. So, when the AI develops the method further, it can itself patch the method, when it enters a situation where a similar type of divergence. It has a reason for why these patches exist, and hence the ability to generate new patches efficiently.

3.4 Simplified $U_{H}$ sufficient for many methods

It’s been argued that many different methods rely upon, if not a complete synthesis $U_{H}$ , at least some simplified version of it. Corrigibility, low impact, and distillation/amplification all seem to be methods that require some simplified version of $U_{H}$ .

Similarly, some concepts that we might want to use or avoid—such as “manipulation” or “understanding the answer”—also may require a simplified utility function. If these concepts can be defined, then one can disentangle them from the rest of the alignment problem, allowing us to instructively consider situations where the concept makes sense.

In that case, a simplified or incomplete construction of $U_{H}$ , using some simplification of the synthesis process, might be sufficient for one of the methods or definitions just listed.

3.5 Applying the intuitions behind $U_{H}$ to analysing other situations

Finally, one could use the definition of $U_{H}$ as inspiration when analysing other methods, which could lead to interesting insights. See for example these posts on figuring out the goals of a hierarchical system.

4 Limits of the method

This section will look at some of the limitations and lacuna of the method described above. For some limitations, it will suggest possible ways of dealing with them; but these are, deliberately, chosen to be extras beyond the scope of the method, where synthesising $U_{H}$ is the whole goal.

4.1 Utility at one point in time

The $U_{H}$ is meant to be a synthesis of the current preferences and meta-preferences of the human $H$ , using one-step hypotheticals to fill out the definition. Human preferences are changeable on a short time scale, without us feeling that we become a different person. Hence it may make sense to replace $U_{H_{t}}$ with some average $U_{H}$ , averaged over a short (or longer) period of time. Shorter period lead to more “overfitting” to momentary urges; longer period allow more manipulation or drift.

4.2 Not a philosophical ideal

The $U_{H}$ is also not a reflective equilibrium or other idealised distillation of what preferences should be. Philosophers will tend to have a more idealised $U_{H}$ , as will those who have reflected a lot and are more willing to be bullet swallowers/bullet bitters. But that is because these people have strong meta-preferences that push in those idealised directions, so any honest synthesis of their preferences must reflect these.

Similarly, this $U_{H}$ is defined to be the preferences of some human $H$ . If that human is bigoted or selfish, their $U_{H}$ will be bigoted or selfish. In contrast, moral preferences that can be considered factually wrong will be filtered out by this construction. Similarly, preferences based on erroneous factual beliefs (“trees can think, so...”) will be removed or qualified (“if trees could think, then...”).

Thus if $H$ is wrong, the $U_{H}$ will not reflect that wrongness; but if $H$ is evil, then $U_{H}$ will reflect that evilness.

Also, the procedure will not distinguish between moral preferences and other types of preferences, unless the human themselves does.

4.3 Individual utility versus common utility

This research agenda will not look into how to combine the $U_{H}$ of different humans. One could simply weight the utilities according to some semi-plausible scale and add them together.

But we could do many other things as well. I’ve suggested removing anti-altruistic preferences before combining the $U_{H}$ ’s into some global utility function $U_{H}$ for all of humanity—or for all future and current sentient beings, or for all beings that could suffer, or for all physical entities.

There are strong game-theoretical reasons to remove anti-altruistic preferences. We might also add philosophical considerations (eg moral realism) or deontological rules (eg human rights, restrictions on copying themselves, extra weighting to certain types of preferences), either to the individual $U_{H}$ or when combining them, or prioritise moral preferences over other types. We might want to preserve the capacity for moral growth, somehow (see Section 4.6).

That can all be done, but is not part of this research agenda, whose sole purpose is to synthesise the individual $U_{H}$ ’s, which can then be used for other purposes.

4.4 Synthesising $U_{H}$ rather than discovering it (moral anti-realism)

The utility $U_{H}$ will be constructed, rather than deduced or discovered. Some moral theories (such as some versions of moral realism) posit that there is a (generally unique) $U_{H}$ waiting to be discovered. But none of these theories give effective methods for doing so.

In the absence of such a definition of how to discover an ideal $U_{H}$ , it would be highly dangerous to assume that finding $U_{H}$ is a process of discovery. Thus the whole method is constructive from the very beginning (and based on a small number of arbitrary choices).

Some versions of moral realism could make use of $U_{H}$ as a starting point of their own definition. Indeed, in practice, moral realism and moral anti-realism seem to be initially almost identical when meta-preferences are taken into account. Moral realists often have mental examples of what counts as “moral realism doesn’t work”, while moral anti-realists still want to simplify and organise moral intuitions. To a first approximation, these approaches can be very similar in practice.

4.5 Self-referential contradictory preferences

There remain problems with self-referential preferences—preferences that claim they should be given more (or less) weight than otherwise (eg “all simple meta-preferences should be penalised”). This was already observed in a previous post.

This includes formal Gödel-style problems, with preferences explicitly contradicting themselves, but those seem solvable—with one or another version of logical uncertainty.

More worrying, from the practical standpoint, is the human tendency to reject values imposed upon them, just because they are imposed upon them. This resembles a preference of the type “reject any $U_{H}$ computed by any synthesis process”. This preference is weakly existent in almost all of us, and a variety of our other preferences should prevent the AI from forcibly re-writing us to become $U_{H}$ -desiring agents.

So it remains not at all clear what happens when the AI says “this is what you really prefer” and we almost inevitably answer “no!”. This concept can be seen, in a sense, as a gratification: we’re not objecting to the outcome of the synthesis, per se, but to the way that the outcome was imposed on us.

Of course, since the $U_{H}$ is constructed rather than real, there is some latitude. It might be possible to involve the human in the construction process, in a way that increases their buy-in (thanks to Tim Genewein for the suggestion). Maybe the AI could construct the first $U_{H}$ , and refine it with further interactions with the human. And maybe, in that situation, if we are confident that $U_{H}$ is pretty safe, we’d want the AI to subtly manipulate the human’s preferences towards it.

4.6 The question of identity and change

It’s not certain that human concepts of identity can be fully captured by identity preferences and meta-preferences. In that case, it is important that human identity be figured out somehow, lest humanity itself vanish even as our preferences are satisfied. Nick Bostrom sketched how this might happen: in the mindless outsourcers scenario, human outsource more and more of their key cognitive features to automated algorithms, until nothing remains of “them” any more.

Somewhat related is the fact that many humans see change and personal or moral growth as a key part of their identity. Can such a desire be accommodated, despite a likely stabilisation of values, without just becoming a random walk across preference space?

Some aspects of growth and change can be accommodated. Humans can certainly become more skilled, more powerful, and more knowledgeable. Since humans don’t distinguish well between terminal and instrumental goals, some forms of factual learning resemble moral learning (“if it turns out that anarchism results in the greatest flourishing of humanity, then I wish to be a anarchist; if not, then not”). If we take into account the preferences of all humans in some roughly equal way (see Section 4.3), then we can get “moral progress” without needing to change anyone’s individual preferences. Finally, professional roles, contracts, and alliances allow for behavioural changes (and sometimes values changes), in ways that maximise the initial values. Sort of like “if I do PR work for the Anarchist party, I will spout anarchist values” and “I accept to make my values more anarchist, in exchange for the Anarchist party shifting their values more towards mine”.

Beyond these examples, it gets trickier to preserve moral change. We might put a slider that makes our own values less instrumental or less selfish over time, but that feels like a cheat: we already know what we will be, we’re just taking the long route to get there. Otherwise, we might allow our values to change within certain defined areas. This would have to be carefully defined to prevent random change, but the main challenge is efficiency: changing values have an inevitable efficiency cost, so there needs to be strong positive pressure to preserve the changes—and not just preserve an unused “possibility for change”, but actual, efficiency-losing, changes. This “possibility for change” can be seen as a gratification: a cost we are willing to pay in terms of perfect efficiency, in order to have a process (continued moral learning) that we prefer.

This should be worth investigating more; it feels like these considerations need to be built into the synthesis process for this to work, rather than the synthesis project making them work itself (thus this kind of preferences is one of the “Global meta-preferences about the outcome of the synthesis process”).

4.7 Other Issues not addressed

These are other important issues that need to be solved to get a fully friendly AI, even if the research agenda works perfectly. They are, however, beyond the scope of this agenda; a partial list of these is:

Actually building the AI itself (left as an exercise to the reader).
Population ethics (though some sort of average of individual human population ethics might be doable with these methods).
Taking into account other factors than individual preferences.
Issues of ontology and ontology changes.
Mind crime (conscious suffering beings simulated within an AI system), though some of the work on identity preferences may help in identifying conscious minds.
Infinite ethics.
Definitions of counterfactuals or which decision theory to use.
Agent foundations, logical uncertainty, how to keep a utility stable.
Acausal trade.
Optimisation daemons/inner optimisers/emergent optimisation.

Note that the Machine Intelligence Research Institute is working heavily on issues 7, 8, and 9.

↩︎
A partial preference being a preference where the human considers only a small part of the variables describing the universe; see Section 1.1.
↩︎
Actually, this specific problem is not included directly in the research agenda, though see Section 4.3.
↩︎
Likely but not certain: we don’t know how effective AIs might become at computing counterfactuals or modelling humans.
↩︎
It makes sense to allow partial preferences to contrast a small number of situations, rather than just two. So “when it comes to watching superhero movies, I’d prefer to watch them with Alan, but Beth will do, and definitely not with Carol”. Since partial preferences with $n$ situations can be built out of smaller number of partial preferences with two situations, allowing more situations is a useful practical move, but doesn’t change the theory.
↩︎
“One-step” refers to hypotheticals that can be removed from the human’s immediate experience (“Imagine that you and your family are in space...”) but not very far removed (so no need for lengthy descriptions that could sway the human’s opinions by hearing them).
↩︎
Equivalently to reducing the weight, we could increase uncertainty about the partial preference, given the unfamiliarity. There are many options for formalisms that lead to the same outcome. Though note that here, we are imposing a penalty (low weight/high uncertainty) for unfamiliarity, whereas the actual human might have incredibly strong internal certainty in their preferences. It’s important to distinguish assumptions that the synthesis process makes, from assumptions that the human might make.
↩︎
Extreme situations are also situations where we have to be very careful to ensure the AI has the right model of all preference possibilities. The flaws of incorrect model can be corrected by enough data, but when data is sparse and unreliable, then model assumptions—including prior—tend to dominate the result.
↩︎
“Natural” does not, of course, mean any of “healthy”, “traditional”, or “non-polluting”. However those using the term “natural” are often assuming all of those.
↩︎
The human’s meta-preferences are also relevant to this it. It might be that, whenever asked about this particular contradiction, the human would answer one way. Therefore $H$ ’s conditional meta-preferences may contain ways of resolving these contradictions, at least if the meta-preferences have high weight and the preferences have low weight.

Conditional meta-preferences can be tricky, though, as we don’t want them to allow the synthesis to get around the one-step hypotheticals restriction. A “if a long theory sounds convincing to me, I want to believe it” meta-preference in practice do away with these restrictions. That particular meta-preference might be cancelled out by the ability of many different theories to sound convincing.
↩︎
We can allow meta-preferences to determine a lot more of their own synthesis if we find an appropriate method that a) always reaches a synthesis, and b) doesn’t artificially boost some preferences through a feedback effect.

What links here?