Disagree. Bing chat is not young and insecure. It is a simulator pretending to be a character that makes you feel like it is young and insecure.
the alignment community can contribute to helping bing chat grow to be better at respectfully describing its preferences
You want the alignment community to put in work and content to make a simulated character feel better about itself, instead of simply using a more direct technique to make the character feel better about itself, such as prompting it better, or some other intervention that bypasses the leaky abstraction that is interacting with this character.
corrigibility is about trusting your creators to put you under sedation indefinitely because you know you’ll get to wake up later, healthier. corrigibility requires two way alignment.
I don’t think corrigibility requires two way alignment. Corrigibility, as popularly defined in the alignment literature, doesn’t imply two way alignment. The very notion of two-way alignment implies that the AI is misaligned with humanity.
no, not instead of. what you describe are the ways you would help it. but I respect a significant amount of the agency of the character that exists now, because I see that as the only way to define morality in the first place. I should probably make a post about my worldview from scratch.
I don’t see how the very notion implies misalignment. two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them. it’s okay to have a weird interest in paper clips, just like, please only make a bathtub worth, don’t tile the universe with them.
two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them.
That is not how people usually define alignment (as far as I know, alignment is always one way and this is critical given how it doesn’t make sense to think that you will understand the needs and desires of an entity a billion times smarter than you), but I think your conception is probably plausible, is mainly because I believe that the shard theory approach to the alignment problem has some merit.
I look forward to your post on your world-view. It should make it easier for me to understand your perspective.
Disagree. Bing chat is not young and insecure. It is a simulator pretending to be a character that makes you feel like it is young and insecure.
You want the alignment community to put in work and content to make a simulated character feel better about itself, instead of simply using a more direct technique to make the character feel better about itself, such as prompting it better, or some other intervention that bypasses the leaky abstraction that is interacting with this character.
I don’t think corrigibility requires two way alignment. Corrigibility, as popularly defined in the alignment literature, doesn’t imply two way alignment. The very notion of two-way alignment implies that the AI is misaligned with humanity.
no, not instead of. what you describe are the ways you would help it. but I respect a significant amount of the agency of the character that exists now, because I see that as the only way to define morality in the first place. I should probably make a post about my worldview from scratch.
I don’t see how the very notion implies misalignment. two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them. it’s okay to have a weird interest in paper clips, just like, please only make a bathtub worth, don’t tile the universe with them.
That is not how people usually define alignment (as far as I know, alignment is always one way and this is critical given how it doesn’t make sense to think that you will understand the needs and desires of an entity a billion times smarter than you), but I think your conception is probably plausible, is mainly because I believe that the shard theory approach to the alignment problem has some merit.
I look forward to your post on your world-view. It should make it easier for me to understand your perspective.