[Question] How path-dependent are human values?

Ege Erdil15 Apr 2022 9:34 UTC

13 points

Human Values World Optimization World Modeling

In the Pointers Problem, John Wentworth points out that human values are a function of latent variables in the world models of humans.

My question is about a specific kind of latent variable, one that’s restricted to be downstream of what we can observe in terms of causality. Suppose we take a world model in the form of a causal network and we split variables into ones that are upstream of observable variables (in the sense that there’s some directed path going from the variable to something we can observe) and ones that aren’t. Say that the variables that have no causal impact on what is observed are “latent” for the purposes of this post. In other words, latent variables are in some sense “epiphenomenal”. This definition of “latent” is more narrow but I think the distinction between causally relevant and causally irrelevant hidden variables is quite important, and I’ll only be focusing on the latter for this question.

In principle, we can always unroll any latent variable model into a path-dependent model with no hidden variables. For example, if we have an (inverted) hidden Markov model with one observable $x_{k}$ and one hidden state $h_{k}$ (subscripts denote time), we can draw a causal graph like this for the “true model” of the world (not the human’s world model, but the “correct model” which characterizes what happens to the variables the human can observe):

graph_1

Here the $h_{k}$ are causally irrelevant latent variables—they have no impact on the state of the world $x_{k}$ but for some reason or another humans care about what they are. For example, if a sufficiently high capacity model renders “pain” a causally obsolete concept, then pain would qualify as a latent variable in the context of this model.

The latent variable $h_{k}$ at time $k$ depends directly on both $x_{k}$ and $h_{k - 1}$ , so to accurately figure out the probability distribution of $h_{k}$ we need to know the whole trajectory of the world from the initial time: $x_{1}, x_{2}, \dots, x_{k}$ .

We can imagine, however, that even if human values depend on latent variables, these variables don’t feed back into each other. In this case, how much we value some state of the world would just be a function of that state of the world itself—we’d only need to know $x_{k}$ to figure out what $h_{k}$ is. This naturally raises the question I ask in the title: empirically, what do we know about the role of path-dependence in human values?

I think path-dependence comes up often in how humans handle the problem of identity. For example, if it were possible to clone a person perfectly and then remove the original from existence through whatever means, even if the resulting states of the world were identical, humans who have different trajectories of how we got there in their mental model could evaluate questions of identity in the present differently. Whether I’m me or not depends on more information than my current physical state or even the world’s current physical state.

This looks like it’s important for purposes of alignment because there’s a natural sense in which path-dependence is an undesirable property to have in your model of the world. If an AI doesn’t have that as an internal concept, it could be simpler for it to learn a strategy of “trick the people who believe in path-dependence into thinking the history that got us here was good” rather than “actually try to optimize for whatever their values are”.

With all that said, I’m interested in what other people think about this question. To what extent are human values path-dependent, and to what extent do you think they should be path-dependent? Both general thoughts & comments and concrete examples of situations where humans care about path-dependence are welcome.

Ege Erdil15 Apr 2022 9:34 UTC

13 points

13 comments2 min readLW link

Human Values World Optimization World Modeling

tailcalled 15 Apr 2022 10:53 UTC
7 points
I think it’s important to remind yourself what the latent vs observable variables represent.
The observable variables are, well, the observables: sights, sounds, smells, touches, etc.. Meanwhile, the latents are all concepts we have that aren’t directly observable, which yes includes intangibles like friendship, but also includes high-level objects like chairs and apples, or low-level objects like atoms.
One reason to mention this is because it has implications for your graphs. The causal graph would more look like this:
(As the graphs for HMMs tend to look. Arguably the sideways arrows for the xs are not needed, but I put them in anyway.)
Of course, that’s not to say that you can’t factor the probability distribution as you did, it just seems more accurate to call it something other than causal graph. Maybe inferential graph? (I suppose you could call your original graph as causal graph of people’s psychology. But then the hs would only represent people’s estimates of their latent variables, which they would claim could differ from their “actual” latent variables, if e.g. they were mistaken.)
Anyway, much more importantly, I think this distinction also answers your question about path-dependence. There’d be lots of path-dependence, and it would not be undesirable to have the path-dependence. For example:
- If you observe that you get put into a simulation, but the simulation otherwise appears realistic, then you have path-dependence because you still know that there is an “outside the simulation”.
- If you observe that an apple in your house, then you update your estimate of h to contain that apple. If you then leave your house, then x no longer shows the apple, but you keep believing in it, even though you wouldn’t believe it if your original observation of your house had not found an apple.
- If you get told that someone is celebrating their birthday tomorrow, then tomorrow you will believe that they are celebrating their birthday, even if you aren’t present there.
- Ege Erdil 15 Apr 2022 11:02 UTC
  1 point
  Parent
  I think you misunderstood my graph—the way I drew it was intentional, not a mistake. Probably I wasn’t explicit enough about how I was splitting the variables and what I do is somewhat different from what johnswentworth does, so let me explain.
  
  Some latent variables could have causal explanatory power, but I’m focusing on ones that don’t seem to have any such power because they are the ones human values depend on most strongly. For example, anything to do with qualia is not going to have any causal arrows going from it to what we can observe, but nevertheless we make inferences about people’s internal state of mind from what we externally observe of their behavior.
  
  As for my questions about path-dependence, I think your responses don’t address the question I meant to ask. For example,
  
  If you observe that an apple in your house, then you update your estimate of h to contain that apple. If you then leave your house, then x no longer shows the apple, but you keep believing in it, even though you wouldn’t believe it if your original observation of your house had not found an apple.
  
  This is not a property of path-dependence in the sense I’m talking about it, because for me anything that has causal explanatory power goes into the state $x_{t}$ . This would include whether there actually is an apple in your house or not, even if your current sensory inputs show no evidence of an apple.
  
  EDIT: I notice now that there’s a central question here about to what extent the latent variables human values are defined over are causally relevant vs causally irrelevant. I assumed that states of mind wouldn’t be relevant but actually they could be causally relevant in the world model of the human even if they wouldn’t be in the “true model”, whatever that means.
  
  I think in this case I still want to say that human values are path-dependent. This is because I care more about whether the values end up being path-dependent in the “true model” and not in the human’s world model (which is imperfect), because a sufficiently powerful AGI would pick up the true model and then try to map its states to the latent variables that the human seems to care about. In other words, for it the latent variables could end up being causally irrelevant, even if for the human they aren’t. I’ve edited the post to reflect this.
  - tailcalled 15 Apr 2022 11:48 UTC
    2 points
    Parent
    I’m still not entirely sure how you classify variables as latent vs observed. Could you classify these as “latent”, “observed” or “ambiguous” to classify?
    The light patterns that hit the photoreceptors in your eyes
    The inferrences made output by your visual cortex
    A person, in the broad sense including e.g. ems
    A human, in the narrow sense of a biological being
    An apple
    A chair
    An atom
    A lie
    A friendship
    - tailcalled 15 Apr 2022 11:56 UTC
      3 points
      Parent
      Wait I guess I’m dumb, this is explained in the OP.
      - Ege Erdil 15 Apr 2022 12:06 UTC
        1 point
        Parent
        I’ve edited the post after the fact to clarify what I meant, so I don’t think you’re dumb (in the sense that I don’t think you missed something that was there). I was just not clear enough the first time around.
        tailcalled 15 Apr 2022 12:07 UTC
        3 points
        Parent
        Ah ok, didn’t realize it was edited.
        Ege Erdil 15 Apr 2022 12:11 UTC
        1 point
        Parent
        I posted this question within less than 20 minutes of the thought occurring to me, so I didn’t understand what was going on well enough & couldn’t express my thoughts properly as a consequence. Your answer helped clarify my thoughts, so thanks for that!
    - Ege Erdil 15 Apr 2022 12:01 UTC
      1 point
      Parent
      Whether particular variables are latent or not is a property relative to what the “correct model” ends up being. Given our current understanding physics, I’d classify your examples like this:
      
      The light patterns that hit the photoreceptors in your eyes: Observed
      The inferrences made output by your visual cortex: Ambiguous
      A person, in the broad sense including e.g. ems: Latent
      A human, in the narrow sense of a biological being: Latent
      An apple: Latent
      A chair: Latent
      An atom: Ambiguous
      A lie: Latent
      A friendship: Latent
      
      With visual cortex inferences and atoms, I think the distinction is fuzzy enough that you have to specify exactly what you mean.
      
      It’s important to notice that atoms are “latent” in both chemistry and quantum field theory in the usual sense, but they are causally relevant in chemistry while they probably aren’t in quantum field theory, so in the context of my question I’d say atoms are observed in chemistry and latent in QFT.
      
      The realization I had while responding to your answer was that I really care about the model that an AGI would learn and not the models that humans use right now, and whether a particular variable is downstream or upstream of observed variables (so, whether they are latent or not in the sense I’ve been using the word here) depends on what the world model you’re using actually is.
tailcalled 15 Apr 2022 12:24 UTC
3 points
Here’s a partial answer to the question:
It seems like one common and useful type of abstraction is aggregating together distinct things with similar effects. Some examples:
- “Heat” can be seen as the aggregation of molecular motion in all sorts of directions; because of chaos, the different directions and different molecules don’t really matter, and therefore we can usefully just add up all their kinetic energies into a variable called “heat”.
- A species like “humans” can be seen as the aggregation (though disjunction rather than sum) of many distinct genetic patterns. However, ultimately the genetic patterns are simple enough that they all code for basically the same thing.
- A person like me can be seen as the aggregation of my state through my entire life trajectory. (Again unlike heat, this would be disjunction rather than sum.) A major part of why the abstraction of “tailcalled” makes sense is that I am causally somewhat consistent across my life trajectory.
An abstraction that aggregates distinct things with similar effects seems like it has a reasonably good chance to be un-path-dependent. However, it’s not quite guaranteed, which you can see by e.g. the third example. While I will have broadly similar effects through my life trajectory, the effects I will have will change over time, and the way they change may depend on what happens to me. For instance if my brain got destructively scanned and uploaded while my body was left behind, then my effects would be “split”, with my psychology continuing into the upload while my appearance stayed with my dead body (until it decayed).
- Ege Erdil 15 Apr 2022 13:42 UTC
  1 point
  Parent
  
  “Heat” can be seen as the aggregation of molecular motion in all sorts of directions; because of chaos, the different directions and different molecules don’t really matter, and therefore we can usefully just add up all their kinetic energies into a variable called “heat”.
  
  Nitpick: this is not strictly correct. This would be the internal energy of a thermodynamic system, but “heat” in thermodynamics refers to energy that’s exchanged between systems, not energy that’s in a system.
  
  Aside from the nitpick, however, point taken.
  
  An abstraction that aggregates distinct things with similar effects seems like it has a reasonably good chance to be un-path-dependent. However, it’s not quite guaranteed, which you can see by e.g. the third example. While I will have broadly similar effects through my life trajectory, the effects I will have will change over time, and the way they change may depend on what happens to me. For instance if my brain got destructively scanned and uploaded while my body was left behind, then my effects would be “split”, with my psychology continuing into the upload while my appearance stayed with my dead body (until it decayed).
  
  I think there is a general problem with these path-dependent concepts in that the ideal version of the concept might be path-dependent, but in practice we can only work within the physical state to keep track of what the path used to be. It’s analogous to how an idealized version of personal identity might require a continuous stream of gradually changing agents and so on, but in practice all we have to go on is what memories people have about how things used to be.
  
  For example, in Lockean property rights theory, “who is the rightful owner of a house” is a path-dependent question. You need to trace the entire history of the house in order to figure out who should own it right now. However, in practice we have to implement property rights by storing some information about the ownership of the house in the current physical state.
  
  If you then train an AI to understand the ownership relation and it learns the relation that we have actually implemented rather than the idealized version we have in mind, it can think that what we really care about is who is “recorded” as the owner of a house in the current physical state rather than who is “legitimately” the owner of the house, and in the extreme cases that can lead it to take some bizarre actions when you ask it to optimize something that has to do with the concept of property rights.
  
  In the end, I think it comes down to which way of doing it takes up less complexity or less bits of information in whatever representation the AI is using to encode these relations. If path-dependent concepts are naturally more complicated for the AI to wrap its head around, SGD can find something that’s path-independent and that fits the training data perfectly, and then you could be in trouble. This is a general story with alignment failure but if we decide we really care about path-dependence then it’s also a concept we’ll want to get the AI to care about somehow.
  - tailcalled 15 Apr 2022 20:47 UTC
    3 points
    Parent
    For example, in Lockean property rights theory, “who is the rightful owner of a house” is a path-dependent question.
    Ah, I think this is a fundamentally different kind of abstraction than the “aggregating together distinct things with similar effects” type of abstraction I mentioned. To distinguish, I suggest we use the name “causal abstraction” for the kind I mentioned, and the name “protocol abstraction” (or something else) for this concept. So:
    Causal abstraction: aggregating together distinct phenomena that have similar causal relations into a lumpy concept that can be modelled as having the same causal relations as its constituents
    Protocol abstraction: extending your ontology with new “epiphenomenal” variables that follow certain made-up rules (primarily for the use in social coordination, so that there is a ground truth even with deception? - but can also be used on an individual level, in values)
    It’s analogous to how an idealized version of personal identity might require a continuous stream of gradually changing agents and so on, but in practice all we have to go on is what memories people have about how things used to be.
    I feel like personal identity has both elements of causal abstraction and of protocol abstraction. E.g. social relationships like debts seem to be strongly tied to protocol abstraction, but there’s also lots of social behavior that only relies on causal abstraction.
    If you then train an AI to understand the ownership relation and it learns the relation that we have actually implemented rather than the idealized version we have in mind, it can think that what we really care about is who is “recorded” as the owner of a house in the current physical state rather than who is “legitimately” the owner of the house, and in the extreme cases that can lead it to take some bizarre actions when you ask it to optimize something that has to do with the concept of property rights.
    I agree.
    Coming up with a normative theory of agency in the case of protocol abstraction actually sounds like a fairly important task. I have some ideas about how to address causal abstraction, but I haven’t really thought much about protocol abstraction before.
    - Ege Erdil 15 Apr 2022 21:54 UTC
      1 point
      Parent
      I think your distinction between causal and protocol abstractions makes sense and it’s related to my distinction between causally relevant vs causally irrelevant latent variables. It’s not quite the same, because abstractions which are rendered causally irrelevant in some world model can still be causal in the sense of aggregating together a bunch of things with similar causal properties.
      
      I feel like personal identity has both elements of causal abstraction and of protocol abstraction. E.g. social relationships like debts seem to be strongly tied to protocol abstraction, but there’s also lots of social behavior that only relies on causal abstraction.
      
      I agree.
      
      Coming up with a normative theory of agency in the case of protocol abstraction actually sounds like a fairly important task. I have some ideas about how to address causal abstraction, but I haven’t really thought much about protocol abstraction before.
      
      Can you clarify what you mean by a “normative theory of agency”? I don’t think I’ve ever seen this phrase before.
      - tailcalled 15 Apr 2022 22:04 UTC
        2 points
        Parent
        Can you clarify what you mean by a “normative theory of agency”? I don’t think I’ve ever seen this phrase before.
        What I mean is stuff like decision theory/selection theorems/rationality; studies of what kinds of ways agents normatively should act.
        Usually such theories do not take abstractions into account. I have some ideas for how to take causal abstractions into account, but I don’t think I’ve seen protocol abstractions investigated much.
        In a sense, they could technically be handled by just having utility functions over universe trajectories rather than universe states, but there are some things about this that seem unnatural (e.g. for the purpose of Alex Turner’s power-seeking theorems, utility functions over trajectories may be extraordinarily power-seeking, and so if we could find a narrower class of utility functions, that would be useful).

No comments.