Inpreviousposts, I’ve been assuming that human values are complete and consistent, but, finally, we are ready to deal with actual human values/preferences/rewards—the whole contradictory, messy, incoherent mess of them.
Define “completely resolving human values”, as an AI extracting a consistent human reward function from the inconsistent data on the preference and values of a single human (leaving aside the easier case of resolving conflicts between different humans). This post will look at how such resolutions could be done—or at least propose an initial attempt, to be improved upon.
EDIT: There is a problem with rendering some of the LaTeX, which I don’t understand. The draft rendered it fine, but not the published version. So I’ve replaced some LaTeX with unicode or image files; it generally works, but there are oversized images in section 3.
Adequate versus elegant
Part of the problem is resolving human values, is that people have been looking to do it too well and too elegantly. This results in either complete resolutions that ignore vast parts of the values (eg hedonic utilitarianism), or in thorough analyses of a tiny part of the problem (eg all the papers published on the trolley problem).
Incomplete resolutions are not sufficient to guide an AI, and elegant complete resolutions seem to be like most utopias: not any good for real humans.
Much better to aim for an adequate complete resolution. Adequacy means two things here:
It doesn’t lead to disastrous outcomes, according to the human’s current values.
If a human has a strong value or meta-value, that will strongly influence the ultimate human reward function, unless their other values point strongly in the opposite direction.
Aiming for adequacy is quite freeing, allowing you to go ahead and construct a resolution, which can then be tweaked and improved upon. It also opens up a whole new space of possible solutions. And, last but not least, any attempt to formalise and write a solution gives a much better understanding of the problem.
Basic framework, then modifications
This post is a first attempt at constructing such an adequate complete resolution. Some of the details will remain to be filled in, others will doubtlessly be changed; nevertheless, this first attempt should be instructive.
The resolution will be built in three steps:
a) It will provide a basic framework for resolving low level values, or meta-values of the same “level”.
b) It will extend this framework to account for some types of meta-values applying to lower level values.
c) It will then allow some meta-values to modify the whole framework.
Finally, the post will conclude with some types of meta-values that are hard to integrate into this framework.
1 Terminology and basic concepts
Let H be a human, whose “true” values we are trying elucidate. Let M be the possible environments (including its transition rules), with μ∈M the actual environment. And let H be the set of future histories that the human may encounter, from time t=0 onward (the human’s past history is seen as part of the environment).
Let R={R:H→R} be a set of rewards. We’ll assume that R is closed under many operations—affine transformations (including negation), adding two rewards together, multiplying them together, and so on. For simplicity, assume that R is a real vector space, generated by a finite number of basis rewards.
Then define V to be a set of potential values of H. This is defined to be all the value/preference/reward statements that H might agree to, more or less strongly.
1.1 The role of the AI
The AI’s role is elucidate how much the human actually accepts statements in V (see for instance here and here). For any given v∈V, it will compute wH(v)≥0, the weight of the value v. For mental calibration purposes, assume that wH(v) is in the 0 to 100 range, and that if the human has no current opinion on v, then wH(v) is zero (the converse is not true: wH(v) could be zero because the human has carefully analysed v but found it to be irrelevant or negative).
The AI will also compute θ(v)∈[−1,1]R∪V={f:R∪V→[−1,1]}, the endorsement of v. This measures the extent to which v ‘approves’ or ‘disapproves’ of a certain reward or value (there is a reward normalisation issue which I’ll elide for now).
Object level values are those which are non-zero only on rewards; ie the v∈V for which θ(v)(v′)=0 for all v′∈V. To avoid the most obvious self-referential problem, any value’s self-endorsement is assumed to be zero (so θ(v)(v)=0). As we will see below, positively endorsing a negative reward is not the same as negatively endorsing a positive reward: θ(v)(−R)=a does not mean the same thing as θ(v)(R)=−a.
Then this post will attempt to define the resolution function Θ, which maps weights, endorsements, and the environment to a single reward function. So if O=[0,100]V×[−1,1]R∪V×M is the cross product of all possible weight functions, endorsement functions, and environments:
Θ:O→R.
In the following, we’ll also have need for a more general θ, and for special distributions over H dependent on a given v∈V; but we’ll define them as and when they are needed.
2 The basic framework
In this section, we’ll introduce a basic framework for resolving rewards. This will involve making a certain number of arbitrary choices, choices that may then get modified in the next section.
This section will deal with the problems with human values being contradictory, underdefined, changeable, and manipulable. As a side effect, this will also deal with the fact that humans can make moral errors (and end up feeling their previous values were ‘wrong’), and that they can derive insights from philosophical thought experiments.
As an example, we’ll use a classic modern dilemma: whether to indulge in bacon or to keep slim.
So let there be two rewards, Rb the bacon reward, and Rs, the slimness reward. Assume that if H always indulges, Rb=1 and Rs=0, while if they never indulge, Rb=0 and Rs=1. There are various tradeoff and gains from trade for intermediate levels of indulgence, the details of which are not relevant here.
Then define R={aRb+bRs|a,b∈R}.
2.1 Contradictory values
Define v1,v2∈V by v1= {I like eating bacon}, and v2= {I want to keep slim}. Given the right normative assumptions, the AI can easily establish that wH(v1) and wH(v2) are both greater than zero. For example, it can note that the human sometimes does indulge, or desires to do so; but also the human feels sad about gaining weight, shame about their lack of discipline, and sometimes engages in anti-bacon precommitment activities.
The natural thing here is to weight Rb by the weight of v1 and the endorsement that v1 gives to Rb (and similarly with v2 and Rs). This means that
Θ(wH,θ,μ)=wH(v1)θ(v1)(Rb)Rb+wH(v2)θ(v2)(Rs)Rs.
Or, for the more general formula, with implicit uncurrying so as to write θ as a function of two variables:
Θ(wH,θ,μ)=∑R∈R,v∈VwH(v)θ(v,R)R.
For this post, I’ll ignore the issue of whether that sum always converges (which it would almost certainly do, in practice).
2.2 Unendorsing rewards
I said there was a difference between a negative endorsement of R, and a positive endorsement of −R. A positive endorsement is just a value judgement that sees −R as good, while the negative endorsement just doesn’t want R to appear at all.
For example, consider v3= {I’m not worried about my weight}. Obviously this has a negative endorsement of Rs, but it doesn’t have a positive endorsement of −Rs - it explicitly doesn’t have a desire to be fat, either. So the weight and endorsement of v3 are fine when it comes to reducing the positive weight of Rs, but not when making a zero or negative weight more negative. To capture that, rewrite Θ as:
Θ(wH,θ,μ)=∑R∈Rmax(0,∑v∈VwH(v)θ(v,R))R.
Then the AI, to maximise H’s rewards, simply needs to follow the policy that maximises that reward.
2.3 Underdefined rewards
Let’s now look at the problem of underdefined values. To illustrate, add the option of liposuction to the main model. If H indulges in bacon, and undergoes liposuction, then both Rb and Rs can be set to 1.
But H might not want to undergo liposuction (assumed, in this model, to be costless). Let Rn be the reward for no liposuction, Rn=0 if liposuction is avoided, and Rn=−1 if it happens, and let v4= {I want to avoid liposuction}. Extend R to {aRb+bRs+cRn|a,b,c∈R}.
Because H hasn’t thought much about liposuction, they currently have wH(v4)=0. But it’s possible they may have firm views on it, after some reflection. If so, it would be good to use those views now. When humans haven’t thought about values, there are manyways they can develop them, depending on how the issue is presented to them and how it interacts with their categories, social circles, moral instincts, and world models.
For example, assume that the AI can figure out that, if H is given a description of liposuction that starts with “lazy people can cheat by...”, then they will be against it: wH(v4) will be greater than zero. However, if they are given a description that starts with “efficient people can optimise by...”, then they will be in favour of it, and wH(v4) will be zero.
If wH(v4)(t,h) is the weight of v4 at future time t, given the future history h∈H, define the discounted future weight as
w∞H(v4,h)=∫∞0γtwH(v4)(t,h)dt∫∞0γtdt,
for, say, γ=999/1000 if t is denominated in days. If h is the history with the “lazy” description, this will be greater than zero. If it’s the history with the “efficient” description, it will be close to zero.
We’d like to use the expected value of w∞H(v4), but there are two problems with that. The first is that many possible futures might involve no reflection about v4 on the part of H. We don’t care about these futures. The other is that these futures depend on the actions of the AI, so that it can manipulate the human’s future values.
So define Hv4, a subset of the set of histories H. This subset is defined firstly so that the H will have relevant opinions about v4: they won’t be indifferent to it. Secondly, these are future on which the human is allowed to develop their values ‘naturally’, without undue rigging and influence on the part of the AI (see this for an example of such a distribution). Note that these need not be histories which will actually happen, just future histories which the AI can estimate. Let pv4 be the probability distribution of future histories, restricted to Hv4 (this requires that the AI pick a sensible probability distribution over its own future policy, at least for the purpose of computing this probability distribution).
Note that the exact definition of Hv4 and pv4 are vitally important and still need to be fully established. That is a critical problem I’ll be returning to in the future.
Laying that aside for the moment, we can define the expected relevant weight:
The above was designed to address underdefined values, but it actually does much more than that. It deals with changeable values, and addresses moral errors and moral learning.
An example of moral error is thinking that you want something, but, upon achieving it, you find that you don’t. Let us examine v2, the desire to be slim. People don’t generally have a strong intrinsic desire for slimness just for the sake of it; instead, they might strive for this because they think it will make them healthier, happier, might increase their future status, might increase their self-discipline in general, or something similar.
So we could replace v2 with v′2= {I desire X}, where X is something that H believes will come out of slimming.
When computing WH(v2) and WH(v′2), the AI will test how H reacts to achieving slimness, or achieving X, and ultimately compute a low WH(v2) but a high WH(v′2). This would be even more the case if Hv2 is allowed to contain impossible future histories, such as hypotheticals where the human miraculously slims without achieving X, or vice-versa.
The use of WH also picks up systematic, predictable moral change. For example, the human may be currently committed to a narrative that seems themselves as disciplined, stereotypical-rational being that will overcome their short term weaknesses. Their weight wH(v2) is high. However, the AI knows that trying to slim will be unpleasant for H, and that they will soon give up as the pain mounts, and change their narrative to one where they accept and balance their own foibles. So the expected WH(v2) is low, under most reasonable futures where humans cannot control their own value changes (this has obvious analogies with major life changes, such as loss of faith or changes in political outlooks).
Then there is the third case where strongly held values may end being incoherent (as I argued is the case of the ‘purity’ moral foundation). Suppose the human deeply believes that v5= {Humans have souls and pigs don’t, so it’s ok to eat pigs, but not ok to defile the human form with liposuction}. This value would thus endorse Rb and Rn. But it’s also based on false facts.
There seems to be three standard ways to resolve this. Replacing “soul” with, say, “mind capable of complex thought and ability to suffer”, they may shift v5 to v6= {I should not eat pigs}. Or if they go for “humans have no souls, so ‘defilement’ makes no sense”, they may embrace v7= {All human enhancements are fine}. Or, as happens often in the real world when people can’t justify their values, they may shift their justification but preserve the basic value: v8= {It is natural and traditional and therefore good to eat pig, and avoid liposuction}.
Now, I feel v8 is probably incoherent as well, but there are no lack of coherent-but-arbitrary reasons to eat pigs and avoid liposuction, so some value set similar to that is plausible.
Then suitably defined Hv would allow the AI to figure out which way the human wants to update their values for v5, v6, v7, and v8, as the human moves away from the incorrect first values to one of the other alternatives.
2.5 Automated philosophy and CEV
The use of WH also allows one to introduce philosophy to the mix. One simply needs to include in Hv the presentation of philosophical thought experiments to H, and H’s reaction and updating. Similarly, one can do the initial steps of coherent extrapolated volition, by including futures where H changes themselves in the desired direction. This can be seen as automating some of philosophy (this approach has nothing to say about epistemology and ontology, for instance).
Indeed, you could define philosophers as people with particularly strong philosophical meta-values: that is, putting a high premium on philosophical consistency, simplicity, and logic.
The more weight is given to philosophy or to frameworks like CEV, the more elegant and coherent the resulting resolution is, but the higher the risk of it going disastrously wrong by losing key parts of human values—we risk running into the problems detailed here and here.
2.6 Meta-values
We’ll conclude this section by looking at how one can apply the above framework to meta-values. There are values that have non-zero endorsements of other values, ie θ(v,w)≠0.
The previous v7= {All human enhancements are fine} could be seen as a meta-value, one that unendorses the anti-liposuction value v4, so θ(v7,v4)=−1. Or we might have one that unendorses short-term values: v9= {Short-term values are less important}, with θ(v9,v1)=−1.
The problem with comes when values start referring to values that start referring to themselves. This allows indirect self-reference, with all the trouble that that brings.
Now, there are various tools for dealing with self-reference or circular reasoning—Paul Christiano’s probabilistic self-reference, and Scott Aaronson’s Eigenmorality are obvious candidates.
But in the spirit of adequacy, I’ll directly define a method that can take all these possibly self-referential values and resolve them. Those who are not interested in the maths here can skip to the next section; there is no real insight here.
Let nV=||V||, and let σ be an ordering (or a permutation) of V, ie a bijective map from {1,2,…,nV} to V. Then recursively define WσH by WσH(σ(1))=WH(σ(1)), and
Thus each WσH is the sum of the actual weight WH, plus the σ-adjusted endorsements of the values preceding it (in the σ ordering), with the zero lower bound. By averaging across the set P(V) of all permutations of V, we can then define:
Then, finally, for resolving the reward, we can use these weights in the standard reward function:
3. The “wrong” Θ: meta-values for the resolution process
The Θ of the previous section is sufficient to resolve the values of an H which has no strong feelings on how those values should be resolved.
But many H may find it inadequate, filled with arbitrary choices, doing too much by hand/fiat, or doing to little. So the next step is to let H’s values affect how the Θ itself works.
Define Θ0 as the framework constructed in the previous section. And let Ω be the set of all such possible resolution frameworks. We now extend θ so that θ(v) can endorse or unendorse not only elements of V and R, but also of Ω.
Then we can define
and define Θ itself as
These formulas make sense, since the various elements of Ω takes values in R, which can be summed. Also, because we can multiply a reward by a positive scalar, there is no need for renormalising or weighting in these summing formulas.
Now, this is not a complete transformation of Θ according to H’s values—for example, there is no place for these values to change the computation of Θθ, which is computed according to the ˆWH previously defined for Θ0. (Note: Those are where the LaTeX errors used to be, and now there are oversized image files which I can’t reduce, sorry!)
But I won’t worry about that for the moment, though I’ll undoubtedly return to it later. First of all, I very much doubt that many humans have strong intuitions about the correct method for resolving contradictions among the different ways of designing a resolution system for mapping most values and meta-values to a reward. And if someone does have such a meta-value, I’d wager it’ll be mostly to benefit a specific object level value or reward, so it’s more instructive to look at the object level.
But the real reason I won’t dig too much into those issues for the moment, is that the next section demonstrates that there are problems with fully self-referential ways of resolving values. I’d like to understand and solve those before getting too meta on the resolution process.
4 Problems with self-referential Θ
Here I’ll look at some of the problems that can occur with fully self-referential Θ and/or v. The presentation will be more informal, since I haven’t defined the language or the formalism to allow such formulation yet.
4.1 All-or-nothing values, and personal identity
Some values put a high premium on simplicity, or on defining the whole of the relevant part of Θ. For example, the paper “An impossibility theorem in population axiology...” argues that total utilitarianism is the only the theory that avoids a series of counter-intuitive problems.
Now, I’ve disagreed that these problems are actually problems. But some people’s intuitions strongly disagree with me, and feel that total utilitarianism is justified by these arguments. Indeed, I get the impression that, for some people, even a small derogation to total utilitarianism is bad: they strongly prefer 100% total utilitarianism to 99.99% total utilitarianism + 0.01% something else.
This could be encoded as a value v10= {I value having a simple populations ethics}. This would provide a bonus based on the overall simplicity of the image of Θ. To do this, we have introduced personal identity (an issue which I’ve argued is unresolved in terms of reward functions), as well as about the image of Θ.
Population ethics feels like an abstract high-level concept, but here is a much more down-to-earth version. When the AI looks forwards, it extrapolates the weight of certain values based on the expected weight in the future. What if the AI extrapolates that w∞H(vi) will be either 0 or 10 in the future, with equal probability? It then reasonably sets WH(vi) to 5.
But the human will live in one of those futures. The AI will be maximising their ‘true goals’ which include WH(vi)=5, while H is forced into extreme values of wH (0 or 10) which do not correspond to the value the AI is currently maximising. So v11= {I want to agree with the reward that Θ computes} is a reasonable meta-value, that would reward closeness between expected future values and actual future values.
In that case, one thing the AI would be motivated to do, is to manipulate H so that they have the ‘right’ weights in the future. But this might not always be possible. And H might see that as a dubious thing to do.
Note here that this is not a problem of desiring personal moral growth in the future. Assuming that can be defined, the AI can then grant it. The problem would be wanting personal moral growth and wanting the AI to follow the values that emerge from this growth.
4.2 You’re not the boss of me!
For self-reference, we don’t need Gödel or Russell. There is a much simpler, more natural self-reference paradox lurking here, one that is very common in humans: the urge to not be told what to do.
If the AI computes Θ(wH,θ,μ)=R, there are many humans who would, on principle, declare and decide that their reward was something other than R. This could be a value v12= {Θ incorrectly computes my values}. I’m not sure how to resolve this problem, or even if it’s much of a problem (if the human will disagree equally no matter what, then we may as well ignore that disagreement; and if they disagree to different degrees in different circumstances, this gives something to minimise and trade-off against other values). But I’d like to understand and formalise this better.
5 Conclusion: much more work
I hope this post demonstrates what I am hoping to achieve, and how we might start going about it. Combining this resolution project, with the means of extracting human values would then allow the Inverse Reinforcement Learning project to succeed in full generality: we could then have the AI deduce human values from observation, and then follow them. This seems like a potential recipe for a Friendly-ish AI.
Resolving human values, completely and adequately
In previous posts, I’ve been assuming that human values are complete and consistent, but, finally, we are ready to deal with actual human values/preferences/rewards—the whole contradictory, messy, incoherent mess of them.
Define “completely resolving human values”, as an AI extracting a consistent human reward function from the inconsistent data on the preference and values of a single human (leaving aside the easier case of resolving conflicts between different humans). This post will look at how such resolutions could be done—or at least propose an initial attempt, to be improved upon.
EDIT: There is a problem with rendering some of the LaTeX, which I don’t understand. The draft rendered it fine, but not the published version. So I’ve replaced some LaTeX with unicode or image files; it generally works, but there are oversized images in section 3.
Adequate versus elegant
Part of the problem is resolving human values, is that people have been looking to do it too well and too elegantly. This results in either complete resolutions that ignore vast parts of the values (eg hedonic utilitarianism), or in thorough analyses of a tiny part of the problem (eg all the papers published on the trolley problem).
Incomplete resolutions are not sufficient to guide an AI, and elegant complete resolutions seem to be like most utopias: not any good for real humans.
Much better to aim for an adequate complete resolution. Adequacy means two things here:
It doesn’t lead to disastrous outcomes, according to the human’s current values.
If a human has a strong value or meta-value, that will strongly influence the ultimate human reward function, unless their other values point strongly in the opposite direction.
Aiming for adequacy is quite freeing, allowing you to go ahead and construct a resolution, which can then be tweaked and improved upon. It also opens up a whole new space of possible solutions. And, last but not least, any attempt to formalise and write a solution gives a much better understanding of the problem.
Basic framework, then modifications
This post is a first attempt at constructing such an adequate complete resolution. Some of the details will remain to be filled in, others will doubtlessly be changed; nevertheless, this first attempt should be instructive.
The resolution will be built in three steps:
a) It will provide a basic framework for resolving low level values, or meta-values of the same “level”.
b) It will extend this framework to account for some types of meta-values applying to lower level values.
c) It will then allow some meta-values to modify the whole framework.
Finally, the post will conclude with some types of meta-values that are hard to integrate into this framework.
1 Terminology and basic concepts
Let H be a human, whose “true” values we are trying elucidate. Let M be the possible environments (including its transition rules), with μ∈M the actual environment. And let H be the set of future histories that the human may encounter, from time t=0 onward (the human’s past history is seen as part of the environment).
Let R={R:H→R} be a set of rewards. We’ll assume that R is closed under many operations—affine transformations (including negation), adding two rewards together, multiplying them together, and so on. For simplicity, assume that R is a real vector space, generated by a finite number of basis rewards.
Then define V to be a set of potential values of H. This is defined to be all the value/preference/reward statements that H might agree to, more or less strongly.
1.1 The role of the AI
The AI’s role is elucidate how much the human actually accepts statements in V (see for instance here and here). For any given v∈V, it will compute wH(v)≥0, the weight of the value v. For mental calibration purposes, assume that wH(v) is in the 0 to 100 range, and that if the human has no current opinion on v, then wH(v) is zero (the converse is not true: wH(v) could be zero because the human has carefully analysed v but found it to be irrelevant or negative).
The AI will also compute θ(v)∈[−1,1]R∪V={f:R∪V→[−1,1]}, the endorsement of v. This measures the extent to which v ‘approves’ or ‘disapproves’ of a certain reward or value (there is a reward normalisation issue which I’ll elide for now).
Object level values are those which are non-zero only on rewards; ie the v∈V for which θ(v)(v′)=0 for all v′∈V. To avoid the most obvious self-referential problem, any value’s self-endorsement is assumed to be zero (so θ(v)(v)=0). As we will see below, positively endorsing a negative reward is not the same as negatively endorsing a positive reward: θ(v)(−R)=a does not mean the same thing as θ(v)(R)=−a.
Then this post will attempt to define the resolution function Θ, which maps weights, endorsements, and the environment to a single reward function. So if O=[0,100]V×[−1,1]R∪V×M is the cross product of all possible weight functions, endorsement functions, and environments:
In the following, we’ll also have need for a more general θ, and for special distributions over H dependent on a given v∈V; but we’ll define them as and when they are needed.
2 The basic framework
In this section, we’ll introduce a basic framework for resolving rewards. This will involve making a certain number of arbitrary choices, choices that may then get modified in the next section.
This section will deal with the problems with human values being contradictory, underdefined, changeable, and manipulable. As a side effect, this will also deal with the fact that humans can make moral errors (and end up feeling their previous values were ‘wrong’), and that they can derive insights from philosophical thought experiments.
As an example, we’ll use a classic modern dilemma: whether to indulge in bacon or to keep slim.
So let there be two rewards, Rb the bacon reward, and Rs, the slimness reward. Assume that if H always indulges, Rb=1 and Rs=0, while if they never indulge, Rb=0 and Rs=1. There are various tradeoff and gains from trade for intermediate levels of indulgence, the details of which are not relevant here.
Then define R={aRb+bRs|a,b∈R}.
2.1 Contradictory values
Define v1,v2∈V by v1= {I like eating bacon}, and v2= {I want to keep slim}. Given the right normative assumptions, the AI can easily establish that wH(v1) and wH(v2) are both greater than zero. For example, it can note that the human sometimes does indulge, or desires to do so; but also the human feels sad about gaining weight, shame about their lack of discipline, and sometimes engages in anti-bacon precommitment activities.
The natural thing here is to weight Rb by the weight of v1 and the endorsement that v1 gives to Rb (and similarly with v2 and Rs). This means that
Or, for the more general formula, with implicit uncurrying so as to write θ as a function of two variables:
For this post, I’ll ignore the issue of whether that sum always converges (which it would almost certainly do, in practice).
2.2 Unendorsing rewards
I said there was a difference between a negative endorsement of R, and a positive endorsement of −R. A positive endorsement is just a value judgement that sees −R as good, while the negative endorsement just doesn’t want R to appear at all.
For example, consider v3= {I’m not worried about my weight}. Obviously this has a negative endorsement of Rs, but it doesn’t have a positive endorsement of −Rs - it explicitly doesn’t have a desire to be fat, either. So the weight and endorsement of v3 are fine when it comes to reducing the positive weight of Rs, but not when making a zero or negative weight more negative. To capture that, rewrite Θ as:
Then the AI, to maximise H’s rewards, simply needs to follow the policy that maximises that reward.
2.3 Underdefined rewards
Let’s now look at the problem of underdefined values. To illustrate, add the option of liposuction to the main model. If H indulges in bacon, and undergoes liposuction, then both Rb and Rs can be set to 1.
But H might not want to undergo liposuction (assumed, in this model, to be costless). Let Rn be the reward for no liposuction, Rn=0 if liposuction is avoided, and Rn=−1 if it happens, and let v4= {I want to avoid liposuction}. Extend R to {aRb+bRs+cRn|a,b,c∈R}.
Because H hasn’t thought much about liposuction, they currently have wH(v4)=0. But it’s possible they may have firm views on it, after some reflection. If so, it would be good to use those views now. When humans haven’t thought about values, there are many ways they can develop them, depending on how the issue is presented to them and how it interacts with their categories, social circles, moral instincts, and world models.
For example, assume that the AI can figure out that, if H is given a description of liposuction that starts with “lazy people can cheat by...”, then they will be against it: wH(v4) will be greater than zero. However, if they are given a description that starts with “efficient people can optimise by...”, then they will be in favour of it, and wH(v4) will be zero.
If wH(v4)(t,h) is the weight of v4 at future time t, given the future history h∈H, define the discounted future weight as
for, say, γ=999/1000 if t is denominated in days. If h is the history with the “lazy” description, this will be greater than zero. If it’s the history with the “efficient” description, it will be close to zero.
We’d like to use the expected value of w∞H(v4), but there are two problems with that. The first is that many possible futures might involve no reflection about v4 on the part of H. We don’t care about these futures. The other is that these futures depend on the actions of the AI, so that it can manipulate the human’s future values.
So define Hv4, a subset of the set of histories H. This subset is defined firstly so that the H will have relevant opinions about v4: they won’t be indifferent to it. Secondly, these are future on which the human is allowed to develop their values ‘naturally’, without undue rigging and influence on the part of the AI (see this for an example of such a distribution). Note that these need not be histories which will actually happen, just future histories which the AI can estimate. Let pv4 be the probability distribution of future histories, restricted to Hv4 (this requires that the AI pick a sensible probability distribution over its own future policy, at least for the purpose of computing this probability distribution).
Note that the exact definition of Hv4 and pv4 are vitally important and still need to be fully established. That is a critical problem I’ll be returning to in the future.
Laying that aside for the moment, we can define the expected relevant weight:
WH(v4)=Epv4w∞H(v4,h)=∑h∈Hv4pv4(h)∫∞0γtwH(v4)(t,h)dt∫∞0γtdt.
Then the formula for Θ becomes:
using WH instead of wH.
2.4 Moral errors and moral learning
The above was designed to address underdefined values, but it actually does much more than that. It deals with changeable values, and addresses moral errors and moral learning.
An example of moral error is thinking that you want something, but, upon achieving it, you find that you don’t. Let us examine v2, the desire to be slim. People don’t generally have a strong intrinsic desire for slimness just for the sake of it; instead, they might strive for this because they think it will make them healthier, happier, might increase their future status, might increase their self-discipline in general, or something similar.
So we could replace v2 with v′2= {I desire X}, where X is something that H believes will come out of slimming.
When computing WH(v2) and WH(v′2), the AI will test how H reacts to achieving slimness, or achieving X, and ultimately compute a low WH(v2) but a high WH(v′2). This would be even more the case if Hv2 is allowed to contain impossible future histories, such as hypotheticals where the human miraculously slims without achieving X, or vice-versa.
The use of WH also picks up systematic, predictable moral change. For example, the human may be currently committed to a narrative that seems themselves as disciplined, stereotypical-rational being that will overcome their short term weaknesses. Their weight wH(v2) is high. However, the AI knows that trying to slim will be unpleasant for H, and that they will soon give up as the pain mounts, and change their narrative to one where they accept and balance their own foibles. So the expected WH(v2) is low, under most reasonable futures where humans cannot control their own value changes (this has obvious analogies with major life changes, such as loss of faith or changes in political outlooks).
Then there is the third case where strongly held values may end being incoherent (as I argued is the case of the ‘purity’ moral foundation). Suppose the human deeply believes that v5= {Humans have souls and pigs don’t, so it’s ok to eat pigs, but not ok to defile the human form with liposuction}. This value would thus endorse Rb and Rn. But it’s also based on false facts.
There seems to be three standard ways to resolve this. Replacing “soul” with, say, “mind capable of complex thought and ability to suffer”, they may shift v5 to v6= {I should not eat pigs}. Or if they go for “humans have no souls, so ‘defilement’ makes no sense”, they may embrace v7= {All human enhancements are fine}. Or, as happens often in the real world when people can’t justify their values, they may shift their justification but preserve the basic value: v8= {It is natural and traditional and therefore good to eat pig, and avoid liposuction}.
Now, I feel v8 is probably incoherent as well, but there are no lack of coherent-but-arbitrary reasons to eat pigs and avoid liposuction, so some value set similar to that is plausible.
Then suitably defined Hv would allow the AI to figure out which way the human wants to update their values for v5, v6, v7, and v8, as the human moves away from the incorrect first values to one of the other alternatives.
2.5 Automated philosophy and CEV
The use of WH also allows one to introduce philosophy to the mix. One simply needs to include in Hv the presentation of philosophical thought experiments to H, and H’s reaction and updating. Similarly, one can do the initial steps of coherent extrapolated volition, by including futures where H changes themselves in the desired direction. This can be seen as automating some of philosophy (this approach has nothing to say about epistemology and ontology, for instance).
Indeed, you could define philosophers as people with particularly strong philosophical meta-values: that is, putting a high premium on philosophical consistency, simplicity, and logic.
The more weight is given to philosophy or to frameworks like CEV, the more elegant and coherent the resulting resolution is, but the higher the risk of it going disastrously wrong by losing key parts of human values—we risk running into the problems detailed here and here.
2.6 Meta-values
We’ll conclude this section by looking at how one can apply the above framework to meta-values. There are values that have non-zero endorsements of other values, ie θ(v,w)≠0.
The previous v7= {All human enhancements are fine} could be seen as a meta-value, one that unendorses the anti-liposuction value v4, so θ(v7,v4)=−1. Or we might have one that unendorses short-term values: v9= {Short-term values are less important}, with θ(v9,v1)=−1.
The problem with comes when values start referring to values that start referring to themselves. This allows indirect self-reference, with all the trouble that that brings.
Now, there are various tools for dealing with self-reference or circular reasoning—Paul Christiano’s probabilistic self-reference, and Scott Aaronson’s Eigenmorality are obvious candidates.
But in the spirit of adequacy, I’ll directly define a method that can take all these possibly self-referential values and resolve them. Those who are not interested in the maths here can skip to the next section; there is no real insight here.
Let nV=||V||, and let σ be an ordering (or a permutation) of V, ie a bijective map from {1,2,…,nV} to V. Then recursively define WσH by WσH(σ(1))=WH(σ(1)), and
Thus each WσH is the sum of the actual weight WH, plus the σ-adjusted endorsements of the values preceding it (in the σ ordering), with the zero lower bound. By averaging across the set P(V) of all permutations of V, we can then define:
Then, finally, for resolving the reward, we can use these weights in the standard reward function:
3. The “wrong” Θ: meta-values for the resolution process
The Θ of the previous section is sufficient to resolve the values of an H which has no strong feelings on how those values should be resolved.
But many H may find it inadequate, filled with arbitrary choices, doing too much by hand/fiat, or doing to little. So the next step is to let H’s values affect how the Θ itself works.
Define Θ0 as the framework constructed in the previous section. And let Ω be the set of all such possible resolution frameworks. We now extend θ so that θ(v) can endorse or unendorse not only elements of V and R, but also of Ω.
Then we can define
and define Θ itself as
These formulas make sense, since the various elements of Ω takes values in R, which can be summed. Also, because we can multiply a reward by a positive scalar, there is no need for renormalising or weighting in these summing formulas.
Now, this is not a complete transformation of Θ according to H’s values—for example, there is no place for these values to change the computation of Θθ, which is computed according to the ˆWH previously defined for Θ0. (Note: Those are where the LaTeX errors used to be, and now there are oversized image files which I can’t reduce, sorry!)
But I won’t worry about that for the moment, though I’ll undoubtedly return to it later. First of all, I very much doubt that many humans have strong intuitions about the correct method for resolving contradictions among the different ways of designing a resolution system for mapping most values and meta-values to a reward. And if someone does have such a meta-value, I’d wager it’ll be mostly to benefit a specific object level value or reward, so it’s more instructive to look at the object level.
But the real reason I won’t dig too much into those issues for the moment, is that the next section demonstrates that there are problems with fully self-referential ways of resolving values. I’d like to understand and solve those before getting too meta on the resolution process.
4 Problems with self-referential Θ
Here I’ll look at some of the problems that can occur with fully self-referential Θ and/or v. The presentation will be more informal, since I haven’t defined the language or the formalism to allow such formulation yet.
4.1 All-or-nothing values, and personal identity
Some values put a high premium on simplicity, or on defining the whole of the relevant part of Θ. For example, the paper “An impossibility theorem in population axiology...” argues that total utilitarianism is the only the theory that avoids a series of counter-intuitive problems.
Now, I’ve disagreed that these problems are actually problems. But some people’s intuitions strongly disagree with me, and feel that total utilitarianism is justified by these arguments. Indeed, I get the impression that, for some people, even a small derogation to total utilitarianism is bad: they strongly prefer 100% total utilitarianism to 99.99% total utilitarianism + 0.01% something else.
This could be encoded as a value v10= {I value having a simple populations ethics}. This would provide a bonus based on the overall simplicity of the image of Θ. To do this, we have introduced personal identity (an issue which I’ve argued is unresolved in terms of reward functions), as well as about the image of Θ.
Population ethics feels like an abstract high-level concept, but here is a much more down-to-earth version. When the AI looks forwards, it extrapolates the weight of certain values based on the expected weight in the future. What if the AI extrapolates that w∞H(vi) will be either 0 or 10 in the future, with equal probability? It then reasonably sets WH(vi) to 5.
But the human will live in one of those futures. The AI will be maximising their ‘true goals’ which include WH(vi)=5, while H is forced into extreme values of wH (0 or 10) which do not correspond to the value the AI is currently maximising. So v11= {I want to agree with the reward that Θ computes} is a reasonable meta-value, that would reward closeness between expected future values and actual future values.
In that case, one thing the AI would be motivated to do, is to manipulate H so that they have the ‘right’ weights in the future. But this might not always be possible. And H might see that as a dubious thing to do.
Note here that this is not a problem of desiring personal moral growth in the future. Assuming that can be defined, the AI can then grant it. The problem would be wanting personal moral growth and wanting the AI to follow the values that emerge from this growth.
4.2 You’re not the boss of me!
For self-reference, we don’t need Gödel or Russell. There is a much simpler, more natural self-reference paradox lurking here, one that is very common in humans: the urge to not be told what to do.
If the AI computes Θ(wH,θ,μ)=R, there are many humans who would, on principle, declare and decide that their reward was something other than R. This could be a value v12= {Θ incorrectly computes my values}. I’m not sure how to resolve this problem, or even if it’s much of a problem (if the human will disagree equally no matter what, then we may as well ignore that disagreement; and if they disagree to different degrees in different circumstances, this gives something to minimise and trade-off against other values). But I’d like to understand and formalise this better.
5 Conclusion: much more work
I hope this post demonstrates what I am hoping to achieve, and how we might start going about it. Combining this resolution project, with the means of extracting human values would then allow the Inverse Reinforcement Learning project to succeed in full generality: we could then have the AI deduce human values from observation, and then follow them. This seems like a potential recipe for a Friendly-ish AI.