Resolving human values, completely and adequately

In previous posts, I’ve been assuming that human values are complete and consistent, but, finally, we are ready to deal with actual human values/​preferences/​rewards—the whole contradictory, messy, incoherent mess of them.

Define “completely resolving human values”, as an AI extracting a consistent human reward function from the inconsistent data on the preference and values of a single human (leaving aside the easier case of resolving conflicts between different humans). This post will look at how such resolutions could be done—or at least propose an initial attempt, to be improved upon.

EDIT: There is a problem with rendering some of the LaTeX, which I don’t understand. The draft rendered it fine, but not the published version. So I’ve replaced some LaTeX with unicode or image files; it generally works, but there are oversized images in section 3.

Adequate versus elegant

Part of the problem is resolving human values, is that people have been looking to do it too well and too elegantly. This results in either complete resolutions that ignore vast parts of the values (eg hedonic utilitarianism), or in thorough analyses of a tiny part of the problem (eg all the papers published on the trolley problem).

Incomplete resolutions are not sufficient to guide an AI, and elegant complete resolutions seem to be like most utopias: not any good for real humans.

Much better to aim for an adequate complete resolution. Adequacy means two things here:

  • It doesn’t lead to disastrous outcomes, according to the human’s current values.

  • If a human has a strong value or meta-value, that will strongly influence the ultimate human reward function, unless their other values point strongly in the opposite direction.

Aiming for adequacy is quite freeing, allowing you to go ahead and construct a resolution, which can then be tweaked and improved upon. It also opens up a whole new space of possible solutions. And, last but not least, any attempt to formalise and write a solution gives a much better understanding of the problem.

Basic framework, then modifications

This post is a first attempt at constructing such an adequate complete resolution. Some of the details will remain to be filled in, others will doubtlessly be changed; nevertheless, this first attempt should be instructive.

The resolution will be built in three steps:

  • a) It will provide a basic framework for resolving low level values, or meta-values of the same “level”.

  • b) It will extend this framework to account for some types of meta-values applying to lower level values.

  • c) It will then allow some meta-values to modify the whole framework.

Finally, the post will conclude with some types of meta-values that are hard to integrate into this framework.

1 Terminology and basic concepts

Let be a human, whose “true” values we are trying elucidate. Let be the possible environments (including its transition rules), with the actual environment. And let be the set of future histories that the human may encounter, from time onward (the human’s past history is seen as part of the environment).

Let be a set of rewards. We’ll assume that is closed under many operations—affine transformations (including negation), adding two rewards together, multiplying them together, and so on. For simplicity, assume that is a real vector space, generated by a finite number of basis rewards.

Then define to be a set of potential values of . This is defined to be all the value/​preference/​reward statements that might agree to, more or less strongly.

1.1 The role of the AI

The AI’s role is elucidate how much the human actually accepts statements in (see for instance here and here). For any given , it will compute , the weight of the value . For mental calibration purposes, assume that is in the to range, and that if the human has no current opinion on , then is zero (the converse is not true: could be zero because the human has carefully analysed but found it to be irrelevant or negative).

The AI will also compute , the endorsement of . This measures the extent to which ‘approves’ or ‘disapproves’ of a certain reward or value (there is a reward normalisation issue which I’ll elide for now).

Object level values are those which are non-zero only on rewards; ie the for which for all . To avoid the most obvious self-referential problem, any value’s self-endorsement is assumed to be zero (so ). As we will see below, positively endorsing a negative reward is not the same as negatively endorsing a positive reward: does not mean the same thing as .

Then this post will attempt to define the resolution function , which maps weights, endorsements, and the environment to a single reward function. So if is the cross product of all possible weight functions, endorsement functions, and environments:

In the following, we’ll also have need for a more general , and for special distributions over dependent on a given ; but we’ll define them as and when they are needed.

2 The basic framework

In this section, we’ll introduce a basic framework for resolving rewards. This will involve making a certain number of arbitrary choices, choices that may then get modified in the next section.

This section will deal with the problems with human values being contradictory, underdefined, changeable, and manipulable. As a side effect, this will also deal with the fact that humans can make moral errors (and end up feeling their previous values were ‘wrong’), and that they can derive insights from philosophical thought experiments.

As an example, we’ll use a classic modern dilemma: whether to indulge in bacon or to keep slim.

So let there be two rewards, the bacon reward, and , the slimness reward. Assume that if always indulges, and , while if they never indulge, and . There are various tradeoff and gains from trade for intermediate levels of indulgence, the details of which are not relevant here.

Then define .

2.1 Contradictory values

Define by {I like eating bacon}, and {I want to keep slim}. Given the right normative assumptions, the AI can easily establish that and are both greater than zero. For example, it can note that the human sometimes does indulge, or desires to do so; but also the human feels sad about gaining weight, shame about their lack of discipline, and sometimes engages in anti-bacon precommitment activities.

The natural thing here is to weight by the weight of and the endorsement that gives to (and similarly with and ). This means that

Or, for the more general formula, with implicit uncurrying so as to write as a function of two variables:

For this post, I’ll ignore the issue of whether that sum always converges (which it would almost certainly do, in practice).

2.2 Unendorsing rewards

I said there was a difference between a negative endorsement of , and a positive endorsement of . A positive endorsement is just a value judgement that sees as good, while the negative endorsement just doesn’t want to appear at all.

For example, consider {I’m not worried about my weight}. Obviously this has a negative endorsement of , but it doesn’t have a positive endorsement of - it explicitly doesn’t have a desire to be fat, either. So the weight and endorsement of are fine when it comes to reducing the positive weight of , but not when making a zero or negative weight more negative. To capture that, rewrite as:

Then the AI, to maximise ’s rewards, simply needs to follow the policy that maximises that reward.

2.3 Underdefined rewards

Let’s now look at the problem of underdefined values. To illustrate, add the option of liposuction to the main model. If indulges in bacon, and undergoes liposuction, then both and can be set to .

But might not want to undergo liposuction (assumed, in this model, to be costless). Let be the reward for no liposuction, if liposuction is avoided, and if it happens, and let {I want to avoid liposuction}. Extend to .

Because hasn’t thought much about liposuction, they currently have . But it’s possible they may have firm views on it, after some reflection. If so, it would be good to use those views now. When humans haven’t thought about values, there are many ways they can develop them, depending on how the issue is presented to them and how it interacts with their categories, social circles, moral instincts, and world models.

For example, assume that the AI can figure out that, if is given a description of liposuction that starts with “lazy people can cheat by...”, then they will be against it: will be greater than zero. However, if they are given a description that starts with “efficient people can optimise by...”, then they will be in favour of it, and will be zero.

If is the weight of at future time , given the future history , define the discounted future weight as

for, say, if is denominated in days. If is the history with the “lazy” description, this will be greater than zero. If it’s the history with the “efficient” description, it will be close to zero.

We’d like to use the expected value of , but there are two problems with that. The first is that many possible futures might involve no reflection about on the part of . We don’t care about these futures. The other is that these futures depend on the actions of the AI, so that it can manipulate the human’s future values.

So define , a subset of the set of histories . This subset is defined firstly so that the will have relevant opinions about : they won’t be indifferent to it. Secondly, these are future on which the human is allowed to develop their values ‘naturally’, without undue rigging and influence on the part of the AI (see this for an example of such a distribution). Note that these need not be histories which will actually happen, just future histories which the AI can estimate. Let be the probability distribution of future histories, restricted to (this requires that the AI pick a sensible probability distribution over its own future policy, at least for the purpose of computing this probability distribution).

Note that the exact definition of and are vitally important and still need to be fully established. That is a critical problem I’ll be returning to in the future.

Laying that aside for the moment, we can define the expected relevant weight:

Then the formula for becomes:

using instead of .

2.4 Moral errors and moral learning

The above was designed to address underdefined values, but it actually does much more than that. It deals with changeable values, and addresses moral errors and moral learning.

An example of moral error is thinking that you want something, but, upon achieving it, you find that you don’t. Let us examine , the desire to be slim. People don’t generally have a strong intrinsic desire for slimness just for the sake of it; instead, they might strive for this because they think it will make them healthier, happier, might increase their future status, might increase their self-discipline in general, or something similar.

So we could replace with {I desire X}, where X is something that believes will come out of slimming.

When computing and , the AI will test how reacts to achieving slimness, or achieving X, and ultimately compute a low but a high . This would be even more the case if is allowed to contain impossible future histories, such as hypotheticals where the human miraculously slims without achieving X, or vice-versa.

The use of also picks up systematic, predictable moral change. For example, the human may be currently committed to a narrative that seems themselves as disciplined, stereotypical-rational being that will overcome their short term weaknesses. Their weight is high. However, the AI knows that trying to slim will be unpleasant for , and that they will soon give up as the pain mounts, and change their narrative to one where they accept and balance their own foibles. So the expected is low, under most reasonable futures where humans cannot control their own value changes (this has obvious analogies with major life changes, such as loss of faith or changes in political outlooks).

Then there is the third case where strongly held values may end being incoherent (as I argued is the case of the ‘purity’ moral foundation). Suppose the human deeply believes that {Humans have souls and pigs don’t, so it’s ok to eat pigs, but not ok to defile the human form with liposuction}. This value would thus endorse and . But it’s also based on false facts.

There seems to be three standard ways to resolve this. Replacing “soul” with, say, “mind capable of complex thought and ability to suffer”, they may shift to {I should not eat pigs}. Or if they go for “humans have no souls, so ‘defilement’ makes no sense”, they may embrace {All human enhancements are fine}. Or, as happens often in the real world when people can’t justify their values, they may shift their justification but preserve the basic value: {It is natural and traditional and therefore good to eat pig, and avoid liposuction}.

Now, I feel is probably incoherent as well, but there are no lack of coherent-but-arbitrary reasons to eat pigs and avoid liposuction, so some value set similar to that is plausible.

Then suitably defined would allow the AI to figure out which way the human wants to update their values for , , , and , as the human moves away from the incorrect first values to one of the other alternatives.

2.5 Automated philosophy and CEV

The use of also allows one to introduce philosophy to the mix. One simply needs to include in the presentation of philosophical thought experiments to , and ’s reaction and updating. Similarly, one can do the initial steps of coherent extrapolated volition, by including futures where changes themselves in the desired direction. This can be seen as automating some of philosophy (this approach has nothing to say about epistemology and ontology, for instance).

Indeed, you could define philosophers as people with particularly strong philosophical meta-values: that is, putting a high premium on philosophical consistency, simplicity, and logic.

The more weight is given to philosophy or to frameworks like CEV, the more elegant and coherent the resulting resolution is, but the higher the risk of it going disastrously wrong by losing key parts of human values—we risk running into the problems detailed here and here.

2.6 Meta-values

We’ll conclude this section by looking at how one can apply the above framework to meta-values. There are values that have non-zero endorsements of other values, ie .

The previous {All human enhancements are fine} could be seen as a meta-value, one that unendorses the anti-liposuction value , so Or we might have one that unendorses short-term values: {Short-term values are less important}, with .

The problem with comes when values start referring to values that start referring to themselves. This allows indirect self-reference, with all the trouble that that brings.

Now, there are various tools for dealing with self-reference or circular reasoning—Paul Christiano’s probabilistic self-reference, and Scott Aaronson’s Eigenmorality are obvious candidates.

But in the spirit of adequacy, I’ll directly define a method that can take all these possibly self-referential values and resolve them. Those who are not interested in the maths here can skip to the next section; there is no real insight here.

Let , and let be an ordering (or a permutation) of , ie a bijective map from to . Then recursively define by , and

Thus each is the sum of the actual weight , plus the -adjusted endorsements of the values preceding it (in the ordering), with the zero lower bound. By averaging across the set of all permutations of , we can then define:

Then, finally, for resolving the reward, we can use these weights in the standard reward function:

3. The “wrong” : meta-values for the resolution process

The of the previous section is sufficient to resolve the values of an which has no strong feelings on how those values should be resolved.

But many may find it inadequate, filled with arbitrary choices, doing too much by hand/​fiat, or doing to little. So the next step is to let ’s values affect how the itself works.

Define as the framework constructed in the previous section. And let be the set of all such possible resolution frameworks. We now extend so that can endorse or unendorse not only elements of and , but also of .

Then we can define

and define itself as

These formulas make sense, since the various elements of takes values in , which can be summed. Also, because we can multiply a reward by a positive scalar, there is no need for renormalising or weighting in these summing formulas.

Now, this is not a complete transformation of according to ’s values—for example, there is no place for these values to change the computation of , which is computed according to the previously defined for . (Note: Those are where the LaTeX errors used to be, and now there are oversized image files which I can’t reduce, sorry!)

But I won’t worry about that for the moment, though I’ll undoubtedly return to it later. First of all, I very much doubt that many humans have strong intuitions about the correct method for resolving contradictions among the different ways of designing a resolution system for mapping most values and meta-values to a reward. And if someone does have such a meta-value, I’d wager it’ll be mostly to benefit a specific object level value or reward, so it’s more instructive to look at the object level.

But the real reason I won’t dig too much into those issues for the moment, is that the next section demonstrates that there are problems with fully self-referential ways of resolving values. I’d like to understand and solve those before getting too meta on the resolution process.

4 Problems with self-referential

Here I’ll look at some of the problems that can occur with fully self-referential Θ and/​or v. The presentation will be more informal, since I haven’t defined the language or the formalism to allow such formulation yet.

4.1 All-or-nothing values, and personal identity

Some values put a high premium on simplicity, or on defining the whole of the relevant part of . For example, the paper “An impossibility theorem in population axiology...” argues that total utilitarianism is the only the theory that avoids a series of counter-intuitive problems.

Now, I’ve disagreed that these problems are actually problems. But some people’s intuitions strongly disagree with me, and feel that total utilitarianism is justified by these arguments. Indeed, I get the impression that, for some people, even a small derogation to total utilitarianism is bad: they strongly prefer 100% total utilitarianism to 99.99% total utilitarianism + 0.01% something else.

This could be encoded as a value {I value having a simple populations ethics}. This would provide a bonus based on the overall simplicity of the image of Θ. To do this, we have introduced personal identity (an issue which I’ve argued is unresolved in terms of reward functions), as well as about the image of Θ.

Population ethics feels like an abstract high-level concept, but here is a much more down-to-earth version. When the AI looks forwards, it extrapolates the weight of certain values based on the expected weight in the future. What if the AI extrapolates that will be either or in the future, with equal probability? It then reasonably sets to .

But the human will live in one of those futures. The AI will be maximising their ‘true goals’ which include , while is forced into extreme values of ( or ) which do not correspond to the value the AI is currently maximising. So {I want to agree with the reward that computes} is a reasonable meta-value, that would reward closeness between expected future values and actual future values.

In that case, one thing the AI would be motivated to do, is to manipulate so that they have the ‘right’ weights in the future. But this might not always be possible. And might see that as a dubious thing to do.

Note here that this is not a problem of desiring personal moral growth in the future. Assuming that can be defined, the AI can then grant it. The problem would be wanting personal moral growth and wanting the AI to follow the values that emerge from this growth.

4.2 You’re not the boss of me!

For self-reference, we don’t need Gödel or Russell. There is a much simpler, more natural self-reference paradox lurking here, one that is very common in humans: the urge to not be told what to do.

If the AI computes , there are many humans who would, on principle, declare and decide that their reward was something other than . This could be a value { incorrectly computes my values}. I’m not sure how to resolve this problem, or even if it’s much of a problem (if the human will disagree equally no matter what, then we may as well ignore that disagreement; and if they disagree to different degrees in different circumstances, this gives something to minimise and trade-off against other values). But I’d like to understand and formalise this better.

5 Conclusion: much more work

I hope this post demonstrates what I am hoping to achieve, and how we might start going about it. Combining this resolution project, with the means of extracting human values would then allow the Inverse Reinforcement Learning project to succeed in full generality: we could then have the AI deduce human values from observation, and then follow them. This seems like a potential recipe for a Friendly-ish AI.