In the previous post, I discussed
three preference frameworks for goal-directed agents. In this post, I will
discuss the value-learning sovereign in more detail.
Eliezer Yudkowsky has suggested that people only confront many important problems in value alignment when they are thinking about Sovereigns, but that at the same time, Sovereigns may be impossibly hard in practice. Yudkowsky advocates that people think about Sovereigns first and list out all the associated issues before stepping down their thinking to Genies, because thinking about Genies may result in premature pruning, while thinking about Sovereigns is more likely to generate a complete list of problems that can then be checked against particular Genie approaches to see if they have become any easier.
To this end, I think it is quite useful to discuss how to create a
value-learning sovereign, even if it is not a good idea to actually create one. I should be explicit about the fact that the concrete models in this post are almost certainly wrong (even conditioning on the fact that we have to build a value-learning sovereign); they’re meant to represent the best concrete illustration of value learning that I can currently write down.
Values and ontologies
We want the AI to learn human values from human behavior. Usually, values are represented as a utility function. If the type of the world history is Ω, then a utility function over Ω is of type Ω→[0,1].
To learn U, we must first have some Ω in mind—but what could this Ω be? There are 2 plausible candidates:
The human’s ontology, ΩH. I have some way of mentally representing world states. My ontology contains concepts such as “human” and “happiness”. I can express values, such as caring about human happiness, in this ontology. If the AI has a representation of ΩH, then it may be able to learn the human utility function UH:ΩH→[0,1].
The AI’s ontology, ΩAI. The AI will also model the world somehow.
Probably, its model will be at least partially learned by induction. It will probably make different predictions from me, due to the fact that it might be able to discover physics that I don’t know about (or otherwise model the world differently). Despite
the differences between the AI’s world model and my own, it is quite likely that
my terminal values could be specified well enough in the ontology of a
strongly superintelligent AI, since this ontology is likely to be finer than my own.
How might we more formally represent the ontology? A simple environment model
for talking about ontologies is the partially observable Markov decision
process (POMDP).
A POMDP consists of a number of
iterations. In each iteration, the agent first takes an action (which causes
the state to change stochastically), and then receives an observation of
the next state. First we must define the set of actions A and
observations O. These sets apply both to the human and AI.
Unlike in a standard POMDP, here the agent’s utility function is over
the world history rather than the observed reward.
Now let’s formally define an ontology. An ontology consists of:
a type of world states, S
the distribution over the initial state, s0:ΔS
the stochastic state transition function, st:(S,A)→ΔS, which specifies what distribution of states results starting from a given state if the agent takes a certain action.
the stochastic observation function, o:S→ΔO, which specifies what distribution of observations the agent receives in a given state.
By abuse of notation, let Ω stand both for the type of world histories (list of S values), and the ontology itself.
Note that this model is Cartesian in much the same way as AIXI is, and therefore
faces similar problems. See the paper on realistic world
models for more
details. It is also unrealistic in that it has no explicit “multi-level” structure;
we would expect human and AI concepts to have something like this.
The analysis in the rest of the post will be limited by these problems,
but I think it will still be useful to analyze an incorrect concrete model.
Each stochastic function in the ontology could be represented by a
probabilistic program. For example, consider the following ontology, modelled
after the vacuum cleaner example in Artificial Intelligence: A Modern
Approach:
# action set A = ['left', 'right', 'suck']
# observation set O = ['clean', 'dirty']
# S consists of (vacuum cleaner location, cleanliness) tuples
# where location is between 0 and 9, and cleanliness is a list of 10 booleans
# indicating whether each square is clean.
def s0():
# start at a random location. Each room is clean with 50% probability.
return (random.randrange(10), [random.random() < 0.5 for i in range(10)])
def st(s, a):
loc = s[0]
cleanliness = s[1][:]
if a == 'left':
# move left
loc = max(0, loc - 1)
if a == 'right':
# move right
loc = min(9, loc + 1)
if a == 'suck':
# probably suck dirt from current square
if random.random() < 0.9:
cleanliness[loc] = True
return (loc, cleanliness)
def o(s):
# observe cleanliness of current square
if s[1][s[0]]:
return 'clean'
else:
return 'dirty'
With the ontology in place, we can consider some utility function over it (in this case, discounted cleanliness over time):
def U(state_seq):
util = 0
discount = 1.0
for s in state_seq:
clean = 0
for c in s[1]:
if c:
clean += 1
util += discount * clean
discount *= 0.99
Since U discounts exponentially, it can easily be extended to a utility
function over infinite state sequences.
Planning using an ontology
If the AI already has some ontology Ω and some utility function over the
ontology U, then it is possible for it to search for utility-maximizing
policies. A policy could be represented as a stochastic function π:(A×O)→ΔA, which
specifies what action the agent takes given an action/observation sequence.
Essentially, a policy is a stochastic infinite decision tree. π could
be chosen to maximize E[U(ω)|π], where ω:Ω is
the state sequence and the expectation is with respect to the distribution
defined by the ontology Ω.
Learning setups
With the ontology machinery in place, we have the tools to look at a couple proposals for how to learn the human’s values.
We assume that the AI has access to the observations and actions of the human.
For example, in the vacuum cleaner example, perhaps the AI sees the sequence
['suck', 'clean', 'right', 'clean', 'right', 'dirty', 'suck', 'clean'],
meaning that the human controlling the vacuum cleaner first decided to suck dirt,
observed that the square was now clean, went right, observed that this square
was also clean, etc. From a long enough sequence like this, the AI should
approximately learn the human’s values.
In practice, it might be necessary for the AI to locate humans in the environment.
This adds some additional complexity, but for now I will ignore this.
Learn the human’s utility function expressed in the human ontology, from human behavior
The AI could assume that the human’s values are expressed over the (unknown) human ontology ΩH. The AI has a joint prior over ΩH and a utility function UH:ΩH→[0,1]. Additionally, the AI needs to predict how the human behaves given their ontology and utility function. One approach, common in economics, is to assume that the human maximizes expected utility. However, this model is quite unrealistic, and alternative models have been explored in the field of cognitive science.
We could represent an alternative behavioral model as a policy π:List(A×O)→ΔA, similar to a planning model. The behavioral model should depend on ΩH and UH. While it is possible to set the behavioral model to maximize expected utility, this is psychologically unrealistic, so the behavioral model should allow the human to sometimes take suboptimal actions.
The AI’s planning, once it has the human ontology and utility function
Suppose the AI has inferred the human ontology ΩH, utility function UH, and behavior model. How could it make plans? There are 3 immediate candidates for planning algorithms:
Mimic the human using the behavioral model.
Select a plan that achieves high expected utility according to ΩH and UH.
Select a plan that achieves high expected utility according to ΩAI and UAI, where UAI:ΩAI→[0,1] is a version of UH that has been translated to be over ΩAI rather than ΩH
Planning algorithm 1 is not too interesting by itself; it is probably more useful as an input to other AI control methods, such as approval-directed agents.
Planning algorithm 2 is more interesting. It selects a plan that looks good according to the human’s ontology. This does not take advantage of the AI’s ability to make better predictions than the human, but it does take advantage of the AI’s better ability to search for plans. For example, if I were trying to solve a boolean satisfiability problem, an AI using this algorithm could suggest a solution, because my ontology predicts that this solution works, even though I can’t find the solution myself. In this way, an agent using this planning algorithm is similar to an approval-directed agent. The main difference is that it selects a plan (i.e. an infinite stochastic decision tree) that maximizes how good the human expects the results to be, rather than an action. Otherwise, it is quite similar.
Planning algorithm 3 uses the full power of the AI, including the AI’s ability to make better predictions than the human.
It requires deriving a utility function UAI from the inferred human utility function UH.
If we have UH:ΩH→[0,1], and we want to create UAI:ΩAI→[0,1], one way to get this is to create a probabilistic
ontology mapping function ϕ:ΩAI→ΔΩH, and then define
UAI(ωAI)=E[UH(ϕ(ωAI))]. The ontology mapping function is meant to say which histories in ΩH best represent
some history in ΩAI. Probably, it is more intuitive to map states rather than world histories, but it shouldn’t matter much.
The paper Ontological Crises in Artificial Agents’ Value Systems discusses an ontology mapping method.
While it would be interesting to look more closely at the relation between
planning algorithm 2 and approval-directed agents at some point, I’ll focus on
planning algorithm 3 for the rest of the post. Planning algorithm 3
has multiple problems:
Unless we have a good theory of cognitive science, it is likely that the true human ontology and utility function will have a very low or zero prior probability.
Human values seem underdetermined by the observation/action data. For a given observation/action sequence, there may be many triples of (ontology, utility function, behavior model) leading to this behavior. The AI must have some way of acting appropriately under this uncertainty.
The ontology mapping seems difficult. I’ll say more about just how hard the ontology mapping problem is in a bit.
Learn the human’s utility function expressed in the AI’s ontology, from human behavior
As an alternative to learning the human’s ontology and utility function
expressed in this ontology, the AI could assume that the human’s values are
expressed over the (known) AI ontology ΩAI. The AI has a joint prior
over the utility function UAI:ΩAI→[0,1] and the
behavior model as before. The assumption is that the human plans using the AI’s
ontology, rather than a different human ontology. Current value-learning
algorithms, such as inverse reinforcement
learning and inverse
planning, work this way
because they do not distinguish between the AI’s ontology and the human’s
ontology.
Unfortunately, this model is psychologically implausible. We do not think of
the human’s preferences about the AI’s ontology being a cause of human
behavior; rather, it is the human’s preferences about the human’s ontology
that is a cause of human behavior. One place where this shows up is when the
human takes an action that would be irrational if the human were using the AI’s
ontology (for example, the human calculates something incorrectly
because they do not know about quantum physics, which the AI knows about).
The AI has no choice but to either believe
that the human’s utility function considers making this incorrect calculation
to be optimal, or to explain it as an error according to the behavior model.
For the second option to produce reasonable results, the behavior model must be
quite complex; it will probably talk about the human’s ontology and how the
human’s goals in this ontology relate to UAI, much like proposal 1.
Overally, I do not find this proposal promising. Either the behavioral model
suffices to explain correlated human errors due to the human having an incorrect
ontology (in which case the behavioral model contains all the complexity
of proposal 1), or it does not (in which case the AI will learn the wrong
values). Therefore, I will talk about the first proposal in the rest of this
post.
Instrumental or terminal goals?
Paul Christiano has previously written about the distinction between learning
terminal and instrumental goals in his post, Ambitious vs. narrow value
learning.
It is possible to explore this distinction in proposal 1. Since the human’s
utility function is relative to the human’s ontology, it is not possible for it
to express truly terminal goals. Determining humans’ reflective preferences
about states of the universe requires some kind of philosophical extrapolation
process, in which humans clarify their concepts and develop preferences about
their new concepts.
However, by varying the behavioral model, it is possible to learn either
higher-level instrumental goals (for example, getting a job), or lower-level
instrumental goals (for example, filling out a particular job application). If
the behavior model states that the human behaves by finding subgoals of UH
and then optimizing for them (as we would expect if UH were a high-level
goal), then it is more likely to detect high-level goals. On the other hand, if
the behavior model states that the human optimizes for UH more directly (as
we would expect a human to optimize for a low-level goal), then it is more
likely to detect low-level goals.
Note that, since instrumental goals change over time, we would also
need to have UH change over time. This is a simple modification to the original
model. Obviously, the AI’s goal should be set so it has no incentive to change
the human’s goals to make them easier to optimize. Perhaps its goal at time t
is to maximize expected utility of whatever UH is at time t.
Naively, if the AI’s utility function changes over time, then it will be
dynamically inconsistent. The AI at an earlier time will have an incentive to
lock the current value of UH in place, so that future versions of the AI will
optimize for this UH instead of whatever UH is estimated to be in the
future. This would lead to a system that determines what my instrumental
preferences are, and then continues to optimize for these even as my
instrumental preferences change.
An instrumental preference for autonomy
It seems that a system that locks my “object-level” instrumental goals (such
as filling out a job application) in place would be acting against some of my
other instrumental goals: specifically, my instrumental goal of preserving my
autonomy. Paul discusses this preference in his post:
Humans have many clear instrumental goals like “remaining in effective control of the AI systems I deploy,” “acquiring resources and other influence in the world,” or “better understanding the world and what I want.” A value learner may able to learn robust preferences like these and pursue those instrumental goals using all of its ingenuity.
In general, I will prefer plans that maximize my autonomy, so I could consider
autonomy-maximization to be one of my instrumental goals. This preference
could explain my desire to study moral philosophy, even when this might cause my
moral opinions to change (and therefore be bad according to my current
object-level moral views). By caring about my autonomy, I can mostly preserve
dynamic consistency even as my goals change over time.
More concretely, suppose the state SH in the human’s ontology
contains a field for “autonomy”, indicating how much autonomy I have in this state.
We would hope that state sequences in ΩAI
in which the human has low autonomy get mapped to sequences of
states in SH that have a low autonomy field. For example, state
sequences in which the AI manipulates the human should be mapped to states with
a low autonomy field.
Of course, it would be imprudent to assume that proposal 1 will correctly
do all this. “Human autonomy” seems to be a complex concept, so it would be
difficult to learn. To reduce confusion, it would be a good idea to create more
explicit models of this instrumental preference for autonomy. This seems
related to the hard problem of
corrigibility: the human’s desire
for AIs to be corrigible is really a reflection of the human’s preference for
autonomy. This seems somewhat related to hierarchical planning, so maybe I will
have better models of this preference after understanding hierarchical planning
better.
If a model like this works, then we can ground human values in something
other than terminal goals: specifically, systems of instrumental goals at each
time step that chain together in a tiling fashion, with each instrumental goal
system trusting the next under normal circumstances. I think this is a
promising alternative way to look at human values, though I still lack concrete
models for this.
Acting under uncertainty
The system should have uncertainty about the correct values. In both proposals,
the human utility function is underdetermined by the data. In proposal 1, the
human ontology is underdetermined by the data, and additionally any uncertainty
about the correct ontology mapping method propagates into uncertainty about the
correct utility function.
Under uncertainty about the correct utility function, it is not straightforward
to simply maximize expected utility. This is because the
“loudness” of different
possible preferences matters. Given this, there are 2 clear ways to act under
uncertainty:
The system can use a voting system to select actions, with each possible
human utility function gaining votes proportional to its posterior probability.
Unfortunately, this leads to undesirable results when the majority of
the posterior probability mass is on the wrong preferences. Roughly, we
should only expect this to work when the posterior distribution over preferences
is “centered around” an acceptable preference to optimize.
The system can use minimax
to select a policy that does decently according to all possible utility functions.
In particular, the policy should be at least as good as shutting down according to
all possible utility functions. This method of handling uncertainty has
problems; see the “Combining minimax with value learning” section for
details.
I think it’s plausible that some variant of minimax works for conservatively
optimizing values under uncertainty, so more research in this area
could be useful.
The necessity of overpowered ontology mapping
I claim that, for proposal 1 to work, the ontology mapper needs to be very
powerful and reliable. This is because:
It needs to correctly map abstract concepts. For example, state sequences
in ΩAI in which humans have lost autonomy should get
mapped to state sequences in ΩH that have the “autonomy” field set to a
low number. This seems far less straightforward than, say, recognizing
diamonds in an ontology. This is made even more difficult by the fact that
some important human concepts (including autonomy) are value-laden and might not correspond
to useful predictive concepts.
Since the AI is optimizing over state sequences in ΩAI,
the ontology mapper must work correctly across nearly all state sequences
in ΩAI. Even if there is just one state sequence in ΩAI that
humans would consider bad upon reflection, but which gets mapped
to a good-looking state sequence in ΩH, this may be sufficient
for the AI to select a plan leading to this state sequence.
These problems make me quite pessimistic about this proposal. More research
into ontology identification might yield insights about just how hard
these problems are to solve.
Human understanding of plans
Suppose the AI has created a plan π. Humans could examine this plan by
seeing what state sequences ΩH result from this plan (assuming humans
understand the ΩH ontology). There are 2 obvious ways to do this:
Use the human ontology ΩH to predict the state sequence resulting
from π. This may fail to predict important consequences of the AI’s plan.
For example, if the AI used nanotechnology to solve some problem, and ΩH
does not predict this nanotechnology to do anything, then it will predict
that the AI’s plan will not do much.
Use the AI’s ontology ΩAI to predict the state sequence resulting
from π, and then map this state sequence back to ΩH using ontology
mapping. This will likely predict the consequences of π more accurately
than ΩH. Possibly, this could help to catch errors that result
when the AI accurately infers ΩH and maps between the ontologies
correctly, but incorrectly infers UH. This does not seem like the most
likely form failure to me; errors in ontology mapping seem more likely.
Conclusion
I don’t think any of the models I have described will do anything useful
with superhuman intelligence. The most potentially powerful models require essentially solving cognitive science (to get the behavioral model), and creating an overpowered ontology mapper.
Still, I have identified concrete areas for
further research, which might turn up results useful for both value-learning
sovereigns and other agents. Specifically, further research into value-learning
sovereigns could look at:
Clarifying what the instrumental preference for autonomy looks like. I
would like to see a concrete example (either in the mathematical framework
described in this post, or a different mathematical framework) of an AI
representing (and perhaps also learning) the human’s instrumental preference
for autonomy.
Developing a better understanding of ontology identification. I think that
framing ontology identification as mapping states between ontologies (as
in the paper on ontological crises) has
some theoretical problems, which I hope to discuss in a future post.
Looking more closely at the spectrum between mimicking humans and learning
and optimizing for humans’ terminal goals. Many of the proposals in this
post fall somewhere in the middle of these two possibilities, but I don’t
think I have exhausted all the options.
Studying ways of conservatively maximizing under uncertainty about the right
values, similar to minimax.
A sketch of a value-learning sovereign
In the previous post, I discussed three preference frameworks for goal-directed agents. In this post, I will discuss the value-learning sovereign in more detail.
From the Arbital article on genies:
To this end, I think it is quite useful to discuss how to create a value-learning sovereign, even if it is not a good idea to actually create one. I should be explicit about the fact that the concrete models in this post are almost certainly wrong (even conditioning on the fact that we have to build a value-learning sovereign); they’re meant to represent the best concrete illustration of value learning that I can currently write down.
Values and ontologies
We want the AI to learn human values from human behavior. Usually, values are represented as a utility function. If the type of the world history is Ω, then a utility function over Ω is of type Ω→[0,1]. To learn U, we must first have some Ω in mind—but what could this Ω be? There are 2 plausible candidates:
The human’s ontology, ΩH. I have some way of mentally representing world states. My ontology contains concepts such as “human” and “happiness”. I can express values, such as caring about human happiness, in this ontology. If the AI has a representation of ΩH, then it may be able to learn the human utility function UH:ΩH→[0,1].
The AI’s ontology, ΩAI. The AI will also model the world somehow. Probably, its model will be at least partially learned by induction. It will probably make different predictions from me, due to the fact that it might be able to discover physics that I don’t know about (or otherwise model the world differently). Despite the differences between the AI’s world model and my own, it is quite likely that my terminal values could be specified well enough in the ontology of a strongly superintelligent AI, since this ontology is likely to be finer than my own.
How might we more formally represent the ontology? A simple environment model for talking about ontologies is the partially observable Markov decision process (POMDP). A POMDP consists of a number of iterations. In each iteration, the agent first takes an action (which causes the state to change stochastically), and then receives an observation of the next state. First we must define the set of actions A and observations O. These sets apply both to the human and AI. Unlike in a standard POMDP, here the agent’s utility function is over the world history rather than the observed reward.
Now let’s formally define an ontology. An ontology consists of:
a type of world states, S
the distribution over the initial state, s0:ΔS
the stochastic state transition function, st:(S,A)→ΔS, which specifies what distribution of states results starting from a given state if the agent takes a certain action.
the stochastic observation function, o:S→ΔO, which specifies what distribution of observations the agent receives in a given state.
By abuse of notation, let Ω stand both for the type of world histories (list of S values), and the ontology itself. Note that this model is Cartesian in much the same way as AIXI is, and therefore faces similar problems. See the paper on realistic world models for more details. It is also unrealistic in that it has no explicit “multi-level” structure; we would expect human and AI concepts to have something like this. The analysis in the rest of the post will be limited by these problems, but I think it will still be useful to analyze an incorrect concrete model.
Each stochastic function in the ontology could be represented by a probabilistic program. For example, consider the following ontology, modelled after the vacuum cleaner example in Artificial Intelligence: A Modern Approach:
With the ontology in place, we can consider some utility function over it (in this case, discounted cleanliness over time):
Since U discounts exponentially, it can easily be extended to a utility function over infinite state sequences.
Planning using an ontology
If the AI already has some ontology Ω and some utility function over the ontology U, then it is possible for it to search for utility-maximizing policies. A policy could be represented as a stochastic function π:(A×O)→ΔA, which specifies what action the agent takes given an action/observation sequence. Essentially, a policy is a stochastic infinite decision tree. π could be chosen to maximize E[U(ω)|π], where ω:Ω is the state sequence and the expectation is with respect to the distribution defined by the ontology Ω.
Learning setups
With the ontology machinery in place, we have the tools to look at a couple proposals for how to learn the human’s values. We assume that the AI has access to the observations and actions of the human. For example, in the vacuum cleaner example, perhaps the AI sees the sequence
['suck', 'clean', 'right', 'clean', 'right', 'dirty', 'suck', 'clean']
, meaning that the human controlling the vacuum cleaner first decided to suck dirt, observed that the square was now clean, went right, observed that this square was also clean, etc. From a long enough sequence like this, the AI should approximately learn the human’s values.In practice, it might be necessary for the AI to locate humans in the environment. This adds some additional complexity, but for now I will ignore this.
Learn the human’s utility function expressed in the human ontology, from human behavior
The AI could assume that the human’s values are expressed over the (unknown) human ontology ΩH. The AI has a joint prior over ΩH and a utility function UH:ΩH→[0,1]. Additionally, the AI needs to predict how the human behaves given their ontology and utility function. One approach, common in economics, is to assume that the human maximizes expected utility. However, this model is quite unrealistic, and alternative models have been explored in the field of cognitive science.
We could represent an alternative behavioral model as a policy π:List(A×O)→ΔA, similar to a planning model. The behavioral model should depend on ΩH and UH. While it is possible to set the behavioral model to maximize expected utility, this is psychologically unrealistic, so the behavioral model should allow the human to sometimes take suboptimal actions.
The AI’s planning, once it has the human ontology and utility function
Suppose the AI has inferred the human ontology ΩH, utility function UH, and behavior model. How could it make plans? There are 3 immediate candidates for planning algorithms:
Mimic the human using the behavioral model.
Select a plan that achieves high expected utility according to ΩH and UH.
Select a plan that achieves high expected utility according to ΩAI and UAI, where UAI:ΩAI→[0,1] is a version of UH that has been translated to be over ΩAI rather than ΩH
Planning algorithm 1 is not too interesting by itself; it is probably more useful as an input to other AI control methods, such as approval-directed agents.
Planning algorithm 2 is more interesting. It selects a plan that looks good according to the human’s ontology. This does not take advantage of the AI’s ability to make better predictions than the human, but it does take advantage of the AI’s better ability to search for plans. For example, if I were trying to solve a boolean satisfiability problem, an AI using this algorithm could suggest a solution, because my ontology predicts that this solution works, even though I can’t find the solution myself. In this way, an agent using this planning algorithm is similar to an approval-directed agent. The main difference is that it selects a plan (i.e. an infinite stochastic decision tree) that maximizes how good the human expects the results to be, rather than an action. Otherwise, it is quite similar.
Planning algorithm 3 uses the full power of the AI, including the AI’s ability to make better predictions than the human. It requires deriving a utility function UAI from the inferred human utility function UH. If we have UH:ΩH→[0,1], and we want to create UAI:ΩAI→[0,1], one way to get this is to create a probabilistic ontology mapping function ϕ:ΩAI→ΔΩH, and then define UAI(ωAI)=E[UH(ϕ(ωAI))]. The ontology mapping function is meant to say which histories in ΩH best represent some history in ΩAI. Probably, it is more intuitive to map states rather than world histories, but it shouldn’t matter much. The paper Ontological Crises in Artificial Agents’ Value Systems discusses an ontology mapping method.
While it would be interesting to look more closely at the relation between planning algorithm 2 and approval-directed agents at some point, I’ll focus on planning algorithm 3 for the rest of the post. Planning algorithm 3 has multiple problems:
Unless we have a good theory of cognitive science, it is likely that the true human ontology and utility function will have a very low or zero prior probability.
Human values seem underdetermined by the observation/action data. For a given observation/action sequence, there may be many triples of (ontology, utility function, behavior model) leading to this behavior. The AI must have some way of acting appropriately under this uncertainty.
The ontology mapping seems difficult. I’ll say more about just how hard the ontology mapping problem is in a bit.
Learn the human’s utility function expressed in the AI’s ontology, from human behavior
As an alternative to learning the human’s ontology and utility function expressed in this ontology, the AI could assume that the human’s values are expressed over the (known) AI ontology ΩAI. The AI has a joint prior over the utility function UAI:ΩAI→[0,1] and the behavior model as before. The assumption is that the human plans using the AI’s ontology, rather than a different human ontology. Current value-learning algorithms, such as inverse reinforcement learning and inverse planning, work this way because they do not distinguish between the AI’s ontology and the human’s ontology.
Unfortunately, this model is psychologically implausible. We do not think of the human’s preferences about the AI’s ontology being a cause of human behavior; rather, it is the human’s preferences about the human’s ontology that is a cause of human behavior. One place where this shows up is when the human takes an action that would be irrational if the human were using the AI’s ontology (for example, the human calculates something incorrectly because they do not know about quantum physics, which the AI knows about). The AI has no choice but to either believe that the human’s utility function considers making this incorrect calculation to be optimal, or to explain it as an error according to the behavior model. For the second option to produce reasonable results, the behavior model must be quite complex; it will probably talk about the human’s ontology and how the human’s goals in this ontology relate to UAI, much like proposal 1.
Overally, I do not find this proposal promising. Either the behavioral model suffices to explain correlated human errors due to the human having an incorrect ontology (in which case the behavioral model contains all the complexity of proposal 1), or it does not (in which case the AI will learn the wrong values). Therefore, I will talk about the first proposal in the rest of this post.
Instrumental or terminal goals?
Paul Christiano has previously written about the distinction between learning terminal and instrumental goals in his post, Ambitious vs. narrow value learning. It is possible to explore this distinction in proposal 1. Since the human’s utility function is relative to the human’s ontology, it is not possible for it to express truly terminal goals. Determining humans’ reflective preferences about states of the universe requires some kind of philosophical extrapolation process, in which humans clarify their concepts and develop preferences about their new concepts.
However, by varying the behavioral model, it is possible to learn either higher-level instrumental goals (for example, getting a job), or lower-level instrumental goals (for example, filling out a particular job application). If the behavior model states that the human behaves by finding subgoals of UH and then optimizing for them (as we would expect if UH were a high-level goal), then it is more likely to detect high-level goals. On the other hand, if the behavior model states that the human optimizes for UH more directly (as we would expect a human to optimize for a low-level goal), then it is more likely to detect low-level goals.
Note that, since instrumental goals change over time, we would also need to have UH change over time. This is a simple modification to the original model. Obviously, the AI’s goal should be set so it has no incentive to change the human’s goals to make them easier to optimize. Perhaps its goal at time t is to maximize expected utility of whatever UH is at time t.
Naively, if the AI’s utility function changes over time, then it will be dynamically inconsistent. The AI at an earlier time will have an incentive to lock the current value of UH in place, so that future versions of the AI will optimize for this UH instead of whatever UH is estimated to be in the future. This would lead to a system that determines what my instrumental preferences are, and then continues to optimize for these even as my instrumental preferences change.
An instrumental preference for autonomy
It seems that a system that locks my “object-level” instrumental goals (such as filling out a job application) in place would be acting against some of my other instrumental goals: specifically, my instrumental goal of preserving my autonomy. Paul discusses this preference in his post:
In general, I will prefer plans that maximize my autonomy, so I could consider autonomy-maximization to be one of my instrumental goals. This preference could explain my desire to study moral philosophy, even when this might cause my moral opinions to change (and therefore be bad according to my current object-level moral views). By caring about my autonomy, I can mostly preserve dynamic consistency even as my goals change over time.
More concretely, suppose the state SH in the human’s ontology contains a field for “autonomy”, indicating how much autonomy I have in this state. We would hope that state sequences in ΩAI in which the human has low autonomy get mapped to sequences of states in SH that have a low autonomy field. For example, state sequences in which the AI manipulates the human should be mapped to states with a low autonomy field.
Of course, it would be imprudent to assume that proposal 1 will correctly do all this. “Human autonomy” seems to be a complex concept, so it would be difficult to learn. To reduce confusion, it would be a good idea to create more explicit models of this instrumental preference for autonomy. This seems related to the hard problem of corrigibility: the human’s desire for AIs to be corrigible is really a reflection of the human’s preference for autonomy. This seems somewhat related to hierarchical planning, so maybe I will have better models of this preference after understanding hierarchical planning better.
If a model like this works, then we can ground human values in something other than terminal goals: specifically, systems of instrumental goals at each time step that chain together in a tiling fashion, with each instrumental goal system trusting the next under normal circumstances. I think this is a promising alternative way to look at human values, though I still lack concrete models for this.
Acting under uncertainty
The system should have uncertainty about the correct values. In both proposals, the human utility function is underdetermined by the data. In proposal 1, the human ontology is underdetermined by the data, and additionally any uncertainty about the correct ontology mapping method propagates into uncertainty about the correct utility function.
Under uncertainty about the correct utility function, it is not straightforward to simply maximize expected utility. This is because the “loudness” of different possible preferences matters. Given this, there are 2 clear ways to act under uncertainty:
The system can use a voting system to select actions, with each possible human utility function gaining votes proportional to its posterior probability. Unfortunately, this leads to undesirable results when the majority of the posterior probability mass is on the wrong preferences. Roughly, we should only expect this to work when the posterior distribution over preferences is “centered around” an acceptable preference to optimize.
The system can use minimax to select a policy that does decently according to all possible utility functions. In particular, the policy should be at least as good as shutting down according to all possible utility functions. This method of handling uncertainty has problems; see the “Combining minimax with value learning” section for details.
I think it’s plausible that some variant of minimax works for conservatively optimizing values under uncertainty, so more research in this area could be useful.
The necessity of overpowered ontology mapping
I claim that, for proposal 1 to work, the ontology mapper needs to be very powerful and reliable. This is because:
It needs to correctly map abstract concepts. For example, state sequences in ΩAI in which humans have lost autonomy should get mapped to state sequences in ΩH that have the “autonomy” field set to a low number. This seems far less straightforward than, say, recognizing diamonds in an ontology. This is made even more difficult by the fact that some important human concepts (including autonomy) are value-laden and might not correspond to useful predictive concepts.
Since the AI is optimizing over state sequences in ΩAI, the ontology mapper must work correctly across nearly all state sequences in ΩAI. Even if there is just one state sequence in ΩAI that humans would consider bad upon reflection, but which gets mapped to a good-looking state sequence in ΩH, this may be sufficient for the AI to select a plan leading to this state sequence.
These problems make me quite pessimistic about this proposal. More research into ontology identification might yield insights about just how hard these problems are to solve.
Human understanding of plans
Suppose the AI has created a plan π. Humans could examine this plan by seeing what state sequences ΩH result from this plan (assuming humans understand the ΩH ontology). There are 2 obvious ways to do this:
Use the human ontology ΩH to predict the state sequence resulting from π. This may fail to predict important consequences of the AI’s plan. For example, if the AI used nanotechnology to solve some problem, and ΩH does not predict this nanotechnology to do anything, then it will predict that the AI’s plan will not do much.
Use the AI’s ontology ΩAI to predict the state sequence resulting from π, and then map this state sequence back to ΩH using ontology mapping. This will likely predict the consequences of π more accurately than ΩH. Possibly, this could help to catch errors that result when the AI accurately infers ΩH and maps between the ontologies correctly, but incorrectly infers UH. This does not seem like the most likely form failure to me; errors in ontology mapping seem more likely.
Conclusion
I don’t think any of the models I have described will do anything useful with superhuman intelligence. The most potentially powerful models require essentially solving cognitive science (to get the behavioral model), and creating an overpowered ontology mapper. Still, I have identified concrete areas for further research, which might turn up results useful for both value-learning sovereigns and other agents. Specifically, further research into value-learning sovereigns could look at:
Clarifying what the instrumental preference for autonomy looks like. I would like to see a concrete example (either in the mathematical framework described in this post, or a different mathematical framework) of an AI representing (and perhaps also learning) the human’s instrumental preference for autonomy.
Developing a better understanding of ontology identification. I think that framing ontology identification as mapping states between ontologies (as in the paper on ontological crises) has some theoretical problems, which I hope to discuss in a future post.
Looking more closely at the spectrum between mimicking humans and learning and optimizing for humans’ terminal goals. Many of the proposals in this post fall somewhere in the middle of these two possibilities, but I don’t think I have exhausted all the options.
Studying ways of conservatively maximizing under uncertainty about the right values, similar to minimax.