Richard_Ngo comments on Value systematization: how values become coherent (and misaligned)

Richard_Ngo 12 Jan 2024 2:24 UTC
LW: 3 AF: 2
0
AF
Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it’s fine if the specific gear gets swapped out for a copy. I’m not assuming there’s a mistake anywhere, I’m just assuming you switch from caring about one type of property it has (physical) to another (functional).
In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.
I picture you saying “well, you could just not prioritize them”. But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing “this particular gear”, but you realize that atoms are constantly being removed and new ones added (implausibly, but let’s assume it’s a self-repairing gear) and so there’s no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of “the gear that moves the piston”—then valuing that could be much simpler.
Zooming out: previously you said
I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value $v (x)$ , and it doesn’t run into direct conflict with any other values you have, and your model of $x$ isn’t wrong at the abstraction level it’s defined at, you’ll never want to change $v (x)$ .
I’m worried that we’re just talking about different things here, because I totally agree with what you’re saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you’re going to systematize your values.
Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.
- Thane Ruthenis 12 Jan 2024 3:07 UTC
  LW: 2 AF: 1
  0
  AF Parent
  E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating
  That seems more like value reflection, rather than a value change?
  The way I’d model it is: you have some value $v (x)$ , whose implementations you can’t inspect directly, and some guess about what it is $P (v (x))$ . (That’s how it often works in humans: we don’t have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of “what if we swap the gear for a different one: which one would you care about then?”, your model of that value put the majority of probability mass on $v_{1} (x)$ , which was “I value this particular gear”. But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as “I care about whatever gear is moving the piston”.
  Importantly, that example doesn’t seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.
  (Notably, that’s only possible in scenarios where you don’t have direct access to your values! Where they’re black-boxed, and you have to infer their internals from the outside.)
  the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations
  Sounds right, yep. I’d argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.
  insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values
  Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.
  I suppose I see what you’re getting at here, and I agree that it’s a real dynamic. But I think it’s less important/load-bearing to how agents work than the basic “value translation in a hierarchical world-model” dynamic I’d outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.
  And I think it’s not even particularly strong in humans? “I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who’s easier to get along with” is something that definitely happens. But most instances of value extrapolation aren’t like this.