Fiora Sunshine comments on What Is The Alignment Problem?

Fiora Sunshine 19 Jan 2025 7:52 UTC
4 points
1
In particular, the brain tries to compress the reward stream by modeling it as some (noisy) signal generated from value-assignments to patterns in the brain’s environment. So e.g. the brain might notice a pattern-in-the-environment which we label “sports car”, and if the reward stream tends to spit out positive signals around sports cars (which aren’t already accounted for by the brain’s existing value-assignments to other things), then the brain will (marginally) compress that reward stream by modeling it as (partially) generated from a high value-assignment to sports cars. See the linked posts for a less compressed explanation, and various subtleties.
I’m not sure why we can’t just go with an explanation like… imagine a human with zero neuroplasticity, something like a weight-frozen LLM. Its behaviors will still tend to place certain attractor states into whatever larger systems it’s embedded within, and we can call those states the values of one aspect of the system. Unfreeze the brain, though, resuming the RL, and now the set of attractor states the human embeds in whatever its surroundings are will change. You just won’t be able to extract as much info about what the overall unfrozen system’s values are, because you won’t be able to just ask the current human what it would do in some hypothetical situation and get decent answers (modulo self-deception etc.), because the RL could possibly change what would be the frozen-human’s values ~arbitrarily between now and the situation you’re describing to them coming to pass.
Uh, I’m not sure if that makes what I have in mind sufficiently obvious, but I don’t personally feel very confused about this question; if that explanation leaves something to be desired, lmk and I can take another crack at it.
- johnswentworth 19 Jan 2025 17:04 UTC
  4 points
  1
  Parent
  Wording that the way I’d normally think of it: roughly speaking, a human has well-defined values at any given time, and RL changes those values over time. Let’s call that the change-over-time model.
  One potentially confusing terminological point: the thing that the change-over-time model calls “values” is, I think, the thing I’d call “human’s current estimate of their values” under my model.
  The main shortcoming I see in the change-over-time model is closely related to that terminological point. Under my model, there’s this thing which represents the human’s current estimate of their own values. And crucially, that thing is going to change over time in mostly the sort of way that beliefs change over time. (TBC, I’m not saying peoples’ underlying values never change; they do. But I claim that most changes in practice are in estimates-of-values, not in actual values.) On the other hand, the change-over-time model is much more agnostic about how values change over time—it predicts that RL will change them somehow, but the specifics mostly depend on those reward signals. So these models make different predictions: my model makes a narrower prediction about about how things change over time—e.g. I predict that “I thought I valued X but realized I didn’t really” is the typical case, “I used to value X but now I don’t” is an atypical case. Of course this is difficult to check because most people don’t carefully track the distinction between those two (so just asking them probably won’t measure what we want), but it’s in-principle testable.
  There are probably other substantive differences, but that’s the one which jumps out to me right now.