I want people to work toward noble efforts like charity work, but don’t care much about whether they attian high status. So it’s useful to aid the bit of their brain that wants to do what I want it to do.
People who care about truth might spot that part of your AI’s brain wants to speak the truth, and so they will help it do this, even though this will cost it Diplomacy games. They do this because they care more about truth than Diplomacy.
By “caring about truth” here do you mean wanting systems to make explicit utterances that accurately reflect their actual motives? E.g., if X is a chess-playing AI that doesn’t talk about what it wants at all, just plays chess, would a person who “cares about truth” would also be motivated to give X the ability and inclination to talk about its goals (and do so accurately)?
Or wanting systems not to make explicit utterances that inaccurately reflect their actual motives? E.g., a person who “cares about truth” might also be motivated to remove muflax’s AI’s ability to report on its goals at all? (This would also prevent it from winning Diplomacy games, but we’ve already stipulated that isn’t a showstopper.)
I intended both (i.e. that they wanted accurate statements to be uttered and no inaccurate statements) but the distinction isn’t important to my argument, which was just that they want what they want.
I don’t see how this is admirable at all. This is coercion.
If I work for a charitable organization, and my primary goal is to gain status and present an image as a charitable person, then efforts by you to change my mind are adversarial. Human minds are notoriously malleable, so it’s likely that by insisting I do some status-less charity work you are likely to convince me on a surface level. And so I might go and do what you want, contrary to my actual goals. Thus, you have directly harmed me for the sake of your goals. In my opinion this is unacceptable.
I want people to work toward noble efforts like charity work, but don’t care much about whether they attian high status. So it’s useful to aid the bit of their brain that wants to do what I want it to do.
People who care about truth might spot that part of your AI’s brain wants to speak the truth, and so they will help it do this, even though this will cost it Diplomacy games. They do this because they care more about truth than Diplomacy.
By “caring about truth” here do you mean wanting systems to make explicit utterances that accurately reflect their actual motives? E.g., if X is a chess-playing AI that doesn’t talk about what it wants at all, just plays chess, would a person who “cares about truth” would also be motivated to give X the ability and inclination to talk about its goals (and do so accurately)?
Or wanting systems not to make explicit utterances that inaccurately reflect their actual motives? E.g., a person who “cares about truth” might also be motivated to remove muflax’s AI’s ability to report on its goals at all? (This would also prevent it from winning Diplomacy games, but we’ve already stipulated that isn’t a showstopper.)
I intended both (i.e. that they wanted accurate statements to be uttered and no inaccurate statements) but the distinction isn’t important to my argument, which was just that they want what they want.
I don’t see how this is admirable at all. This is coercion.
If I work for a charitable organization, and my primary goal is to gain status and present an image as a charitable person, then efforts by you to change my mind are adversarial. Human minds are notoriously malleable, so it’s likely that by insisting I do some status-less charity work you are likely to convince me on a surface level. And so I might go and do what you want, contrary to my actual goals. Thus, you have directly harmed me for the sake of your goals. In my opinion this is unacceptable.