[Question] How path-dependent are human values?

In the Pointers Problem, John Wentworth points out that human values are a function of latent variables in the world models of humans.

My question is about a specific kind of latent variable, one that’s restricted to be downstream of what we can observe in terms of causality. Suppose we take a world model in the form of a causal network and we split variables into ones that are upstream of observable variables (in the sense that there’s some directed path going from the variable to something we can observe) and ones that aren’t. Say that the variables that have no causal impact on what is observed are “latent” for the purposes of this post. In other words, latent variables are in some sense “epiphenomenal”. This definition of “latent” is more narrow but I think the distinction between causally relevant and causally irrelevant hidden variables is quite important, and I’ll only be focusing on the latter for this question.

In principle, we can always unroll any latent variable model into a path-dependent model with no hidden variables. For example, if we have an (inverted) hidden Markov model with one observable and one hidden state (subscripts denote time), we can draw a causal graph like this for the “true model” of the world (not the human’s world model, but the “correct model” which characterizes what happens to the variables the human can observe):

graph_1

Here the are causally irrelevant latent variables—they have no impact on the state of the world but for some reason or another humans care about what they are. For example, if a sufficiently high capacity model renders “pain” a causally obsolete concept, then pain would qualify as a latent variable in the context of this model.

The latent variable at time depends directly on both and , so to accurately figure out the probability distribution of we need to know the whole trajectory of the world from the initial time: .

We can imagine, however, that even if human values depend on latent variables, these variables don’t feed back into each other. In this case, how much we value some state of the world would just be a function of that state of the world itself—we’d only need to know to figure out what is. This naturally raises the question I ask in the title: empirically, what do we know about the role of path-dependence in human values?

I think path-dependence comes up often in how humans handle the problem of identity. For example, if it were possible to clone a person perfectly and then remove the original from existence through whatever means, even if the resulting states of the world were identical, humans who have different trajectories of how we got there in their mental model could evaluate questions of identity in the present differently. Whether I’m me or not depends on more information than my current physical state or even the world’s current physical state.

This looks like it’s important for purposes of alignment because there’s a natural sense in which path-dependence is an undesirable property to have in your model of the world. If an AI doesn’t have that as an internal concept, it could be simpler for it to learn a strategy of “trick the people who believe in path-dependence into thinking the history that got us here was good” rather than “actually try to optimize for whatever their values are”.

With all that said, I’m interested in what other people think about this question. To what extent are human values path-dependent, and to what extent do you think they should be path-dependent? Both general thoughts & comments and concrete examples of situations where humans care about path-dependence are welcome.

No comments.