Deconfusing Human Values Research Agenda v1
On Friday I attended the 2020 Foresight AGI Strategy Meeting. Eventually a report will come out summarizing some of what was talked about, but for now I want to focus on what I talked about in my session on deconfusing human values. For that session I wrote up some notes summarizing what I’ve been working on and thinking about. None of it is new, but it is newly condensed in one place and in convenient list form, and it provides a decent summary of the current state of my research agenda for building beneficial superintelligent AI; a version 1 of my agenda, if you will. Thus, I hope this will be helpful in making it a bit clearer what it is I’m working on, why I’m working on it, and what direction my thinking is moving in. As always, if you’re interesting in collaborating on things, whether that be discussing ideas or something more, please reach out.
I think we’re confused about what we really mean when we talk about human values.
This is a problem because:
building aligned AI likely requires a mathematically precise understanding of the structure of human values, though not necessarily the content of human values;
we can’t trust AI to discover that structure for us because we would need to understand it enough to verify the result, and I think we’re so confused about what human values are we couldn’t do that without high risk of error.
What are values?
We don’t have an agreed upon precise definition, but loosely it’s “stuff people care about”.
When I talk about “values” I mean the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology.
Importantly, what people care about is used to make decisions, and this has had implications for existing approaches to understanding values.
Much research on values tries to understand the content of human values or why humans value what they value, but not what the structure of human values is such that we could use it to model arbitrary values. This research unfortunately does not appear very useful to this project.
The best attempts we have right now are based on the theory of preferences.
In this model a preference is a statement located within a (weak, partial, total, etc.)-order. Often written like A > B > C to mean A is preferred to B is preferred to C.
Goodhart effects are robust and preferences in formal models are measures that is not the thing we care about itself
Stated vs. revealed preferences: we generally favor revealed preferences, this approach has some problems:
General vs. specific preferences: do we look for context-independent preferences (“essential” values) or context-dependent preferences
generalized preferences, e.g. “I like cake better than cookies”, can lead to irrational preferences (e.g. non-transitive preferences)
contextualized preferences, e.g. “I like cake better than cookies at this precise moment”, limit our ability to reason about what someone would prefer in new situations
See Stuart Armstrong’s work for an attempt to address these issues so we can turn preferences into utility functions.
Preference based models look to me to be trying to specify human values at the wrong level of abstraction. But what would the right level of abstraction be?
What follows is a summary of what I so far think moves us closer to less confusion about human values. I hope to either think some of this is wrong or insufficient by the end of the discussion!
Agents have fuzzy but definable boundaries.
Everything in every moment causes everything in every next moment up to the limit of the speed of light, but we can find clusters of stuff that interact with themselves in ways that are “aligned” such that the stuff in a cluster makes sense to model as an agent separate from the stuff not in an agent.
Humans (and other agents) cause events. We call this acting.
The process that leads to taking one action rather than another possible action is deciding.
Decisions are made by some decision generation process.
Values are the inputs to the decision generation process that determine its decisions and hence actions.
Preferences and meta-preferences are statistical regularities we can observe over the actions of an agent.
Important differences from preference models:
Preferences are causally after, not causally before, decisions, contrary to the standard preference model.
This is not 100% true. Preferences can be observed by self-aware agents, like humans, and influence the decision generation process.
So then what are values? The inputs to the decision generation process?
My best guess: valence
My best best guess: valence as modeled by minimization of prediction error
This leaves us with new problems. Now rather than trying to infer preferences from observations of behavior, we need to understand the decision generation process and valence in humans, i.e. this is now a neuroscience problem.
underdetermination due to noise; many models are consistent with the same data
this makes it easy for us to get confused, even when we’re trying to deconfuse ourselves
this makes it hard to know if our model is right since we’re often in the situation of explaining rather than predicting
is this a descriptive or causal model?
both. descriptive of what we see, but trying to find the causal mechanism of what we reify as “values” at the human level in terms of “gears” at the neuron level
what is valence?
complexities of going from neurons to human level notions of values
there’s a lot of layers of different systems interacting on the way from neurons to values and we don’t understand enough about almost any of them or even for sure what systems there are in the causal chain
Thanks to Dan Elton, De Kai, Sai Joseph, and several other anonymous participants of the session for their attention, comments, questions, and insights.