Reading this post, the sentence that jumped out was “I’m generally reflectively stable about my own values“
Isn’t this an extremely strong claim? I have no idea how to modify a person or a machine to have reflectively stable values without paying essentially all the utils to value drift- I thought this was an open problem in alignment.
Anyways, I’d assume that typical people aren’t close to reflectively stable, particularly around love and relationships, and that any full-send attempt to become stable would have an outcome scored very poorly by their current values.
This is indeed a moderately unusual thing for a human, and most people would indeed be ill-advised to try to become reflectively stable; there is a right way to do it (which should probably be the topic of some posts at some point), but most peoples’ models of their own values are far too confused to do it correctly if they just directly try. Most likely, they’d end up trying to shoehorn themselves into what-they-think-their-values-are, without actually listening to the underlying parts of themselves where their actual values come from, and then eventually end up depressed.
That said, the version of reflective stability I’m talking about is not an open problem in alignment. The alignment version is about keeping values stable under heavy self-modification; I indeed do not know how to heavily modify my brain while keeping my values stable (and I am accordingly paranoid about drugs which fuck with the reward system). What I’m talking about in the post is merely endorsing my own values and wanting to keep them, which is a standard property of utility maximizers (though that is not to claim that I am necessarily well modeled as a utility maximizer).
without actually listening to the underlying parts of themselves where their actual values come from
Based on your posts, this is totally the kind of thing that I thought you were likely to not be doing, so the fact that you were able to generate this sentence makes me feel better
Reading this post, the sentence that jumped out was “I’m generally reflectively stable about my own values“
Isn’t this an extremely strong claim? I have no idea how to modify a person or a machine to have reflectively stable values without paying essentially all the utils to value drift- I thought this was an open problem in alignment.
Anyways, I’d assume that typical people aren’t close to reflectively stable, particularly around love and relationships, and that any full-send attempt to become stable would have an outcome scored very poorly by their current values.
This is indeed a moderately unusual thing for a human, and most people would indeed be ill-advised to try to become reflectively stable; there is a right way to do it (which should probably be the topic of some posts at some point), but most peoples’ models of their own values are far too confused to do it correctly if they just directly try. Most likely, they’d end up trying to shoehorn themselves into what-they-think-their-values-are, without actually listening to the underlying parts of themselves where their actual values come from, and then eventually end up depressed.
That said, the version of reflective stability I’m talking about is not an open problem in alignment. The alignment version is about keeping values stable under heavy self-modification; I indeed do not know how to heavily modify my brain while keeping my values stable (and I am accordingly paranoid about drugs which fuck with the reward system). What I’m talking about in the post is merely endorsing my own values and wanting to keep them, which is a standard property of utility maximizers (though that is not to claim that I am necessarily well modeled as a utility maximizer).
Based on your posts, this is totally the kind of thing that I thought you were likely to not be doing, so the fact that you were able to generate this sentence makes me feel better