johnswentworth comments on What Is The Alignment Problem?

johnswentworth 19 Jan 2025 17:04 UTC
4 points
1
Wording that the way I’d normally think of it: roughly speaking, a human has well-defined values at any given time, and RL changes those values over time. Let’s call that the change-over-time model.
One potentially confusing terminological point: the thing that the change-over-time model calls “values” is, I think, the thing I’d call “human’s current estimate of their values” under my model.
The main shortcoming I see in the change-over-time model is closely related to that terminological point. Under my model, there’s this thing which represents the human’s current estimate of their own values. And crucially, that thing is going to change over time in mostly the sort of way that beliefs change over time. (TBC, I’m not saying peoples’ underlying values never change; they do. But I claim that most changes in practice are in estimates-of-values, not in actual values.) On the other hand, the change-over-time model is much more agnostic about how values change over time—it predicts that RL will change them somehow, but the specifics mostly depend on those reward signals. So these models make different predictions: my model makes a narrower prediction about about how things change over time—e.g. I predict that “I thought I valued X but realized I didn’t really” is the typical case, “I used to value X but now I don’t” is an atypical case. Of course this is difficult to check because most people don’t carefully track the distinction between those two (so just asking them probably won’t measure what we want), but it’s in-principle testable.
There are probably other substantive differences, but that’s the one which jumps out to me right now.