NickH comments on We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap

NickH 5 Mar 2025 12:32 UTC
1 point
0
Sounds backwards to me. It seems more like “our values are those things that we anticipate will bring us reward” than that rewards are what tell us about our values.

When you say “I thought I wanted X, but then I tried it and it was pretty meh.” That just seems wrong. You really DID want X. You valued it then because you thought it would bring you reward. Maybe, You just happened to be wrong. It’s fine to be wrong about your anticipations. It’s kind of weird to say that you were wrong about your values. Saying that your values change is kind of a cop out and certainly not helpful when considering AI alignment—It suggests that we can never truly know our values—We just get to say “not that” when we encounter counter evidence. Our rewards seem much more real and stable.