Thane Ruthenis comments on Value systematization: how values become coherent (and misaligned)

Thane Ruthenis 12 Jan 2024 3:07 UTC
LW: 2 AF: 1
0
AF
E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating
That seems more like value reflection, rather than a value change?
The way I’d model it is: you have some value $v (x)$ , whose implementations you can’t inspect directly, and some guess about what it is $P (v (x))$ . (That’s how it often works in humans: we don’t have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of “what if we swap the gear for a different one: which one would you care about then?”, your model of that value put the majority of probability mass on $v_{1} (x)$ , which was “I value this particular gear”. But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as “I care about whatever gear is moving the piston”.
Importantly, that example doesn’t seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.
(Notably, that’s only possible in scenarios where you don’t have direct access to your values! Where they’re black-boxed, and you have to infer their internals from the outside.)
the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations
Sounds right, yep. I’d argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.
insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values
Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.
I suppose I see what you’re getting at here, and I agree that it’s a real dynamic. But I think it’s less important/load-bearing to how agents work than the basic “value translation in a hierarchical world-model” dynamic I’d outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.
And I think it’s not even particularly strong in humans? “I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who’s easier to get along with” is something that definitely happens. But most instances of value extrapolation aren’t like this.