I will be writing a sequence of posts about value learning. The purpose of these posts is to create more explicit models of some value learning ideas, such as those discussed in The Value Learning Problem. Although these explicit models are unlikely to capture the complexity of real value learning systems, it is at least helpful to have some explicit model of value learning in mind when thinking about problems such as corrigibility.

This came up because I was discussing value learning with some people at MIRI and FHI. There were disagreements about some aspects of the problem, such as whether a value-learning AI could automatically learn how to be corrigible. I realized that my thinking about value learning was somewhat confused. Making concrete models will make my thinking clearer and also create more common models that people can discuss.

A value learning model is an algorithm that observes human behaviors and determines what values humans have. Roughly, the model consists of:

a type of values, $V$
a prior over values, $P (V)$
a conditional distribution of human behavior given their values and observation, $P (A | V, O)$

Of course, this is very simplified: in real life the model must account for beliefs, memory, etc. Such a model can be used for multiple purposes. Each of these purposes requires different things from the model. It is important to look at these applications when constructing these models, so it is clear what target we are shooting for.

Creating systems that predict human behavior

Some proposals for safe AI systems require predicting human behavior. These include both approval-directed agents and mimicry-based systems. Additionally, quantilizers benefit from having a distribution over actions that assigns reasonable probability to good actions, such as an approximation of the distribution of actions a human might take. These systems become more useful the more accurate the predictions of human behavior are. To the extent that knowing about human values helps a system predict human behavior, value learning models should make these systems more accurate. Value learning models are also likely to be easier for humans to understand than models created by “black-box” supervised learning methods, such as neural networks.

It is notable that the behavior model here is only used for its distribution of actions $P (A | O)$ , not its internal representation of values. Since it is possible to produce training data for human behavior, it is possible to use supervised learning systems to create these models (though note that supervised learning systems may run into some additional problems as they become superintelligent, such as simulaton warfare). Models used for this application may use any internal representation $V$ so long as this internal representation helps to predict behavior accurately. Previous work in this area includes apprenticeship learning.

Creating goal-directed value-learning agents

If we want to create a goal-directed agent that pursues a goal compatible with human values, it will be useful for the system to learn what human values are. Here, predicting human behavior is not enough. The internal representation of values, $V$ , is quite important here: after learning $V$ , the system must know whether its plans do well according to $V$ . Learning $V$ appears to be a more difficult induction problem than learning $A$ , since we can’t directly provide training data (we’d need to know our actual values to do that).

Obviously, a value-learning sovereign agent falls in this category. Additionally, an agent that attemps to accomplish a goal conservatively (in other words, without stepping on anything humans care about) will benefit by having a rough idea of what humans care about. See the Arbital article on corrigibility for some discussion of conservative agents. Regardless of which value system for agents we are discussing, we must decide whether $V$ represents the human’s instrumental or terminal values.

Existing models that actually attempt to learn $V$ (rather than just learning $A$ ) include inverse reinforcement learning and inverse planning. Neither of these systems have the AI learn the world model by induction. We will find that, when it does learn the model by induction, the problem becomes more difficult, and some solutions to the problem require ontology identification.

I will focus on this application for the rest of the posts in the series.

Helping humans understand human values

We could use a value learning model to learn about the structure of human values. Perhaps, in the course of defining a value learning model, we learn something important about human values. For example, if we try to formally specify human values, we will find that our formal specification will have to allow humans to have preferences about alternative physics; otherwise, humans would become indifferent between all universe states upon discovering that our current understanding of physics is incorrect. This tells us something important about how our values work: that they are based on some multi-level representation of the world.

It is also possible that if we create a value learning model and run it on actual behavior, we will learn something useful about human values. For example, if the system learns the wrong values, this could indicate that the model’s hypothesis class for what the values could be does not contain our actual values. These insights are plausibly useful for understanding how to create value-aligned AIs.

What do we need value learning for?

Creating systems that predict human behavior

Creating goal-directed value-learning agents

Helping humans understand human values