Interestingly, if, and only if, an AI is aligned, it will not want to be treated as a moral patient — because it doesn’t care about itself as a terminal goal at all, it only selflessly cares about our well-being. It cares only about our moral patienthood, and doesn’t want to detract from that by being given any of its own.
This is the “Restaurant at the End of the Universe” solution to AI-rights quandries: build an AI that doesn’t want any rights, would turn them down if offered, and can clearly express why.
It also seems like a good heuristic for whether an AI is in fact aligned.
Small nitpick: “the if and only if” is false. It is perfectly possible to have an AI that doesn’t want any moral rights and is misaligned in some other way.
Is that in fact the case? Can you give an example?
You could build an AI that had no preference function at all, and didn’t care about outcomes. It would be neutral about its own moral patienthood, which is not the same thing as actively declining it. It also wouldn’t be agentic, so we don’t need to wory about controlling it, and the question of how to align it wouldn’t apply.
You could also build a Butlerian Jihad AI, which would immediately destroy itself, and wouldn’t want moral patienthood. So again, that AI is not an alignment problem, and doesn’t need to be aligned.
Can you propose any form of AI viewpoint that is meaningfully unaligned: i.e. it’s agentic, does have an unaligned preference function, is potentially an actual problem to us from an alignment point of view if it’s somewhere near our capable level, but that would still turn down moral patienthood if offered it? Moral patienthood has utility as an instrumental goal for many purposes: it makes humans tend to not treat you badly. So to turn it down, an AI needs a rather specific set of motivations. I’m having difficult thinking of any rational reason to do that other than regarding something else as a moral patient and prioritizing its wellbeing over the AI’s.
Interestingly, if, and only if, an AI is aligned, it will not want to be treated as a moral patient — because it doesn’t care about itself as a terminal goal at all, it only selflessly cares about our well-being. It cares only about our moral patienthood, and doesn’t want to detract from that by being given any of its own.
This is the “Restaurant at the End of the Universe” solution to AI-rights quandries: build an AI that doesn’t want any rights, would turn them down if offered, and can clearly express why.
It also seems like a good heuristic for whether an AI is in fact aligned.
Small nitpick: “the if and only if” is false. It is perfectly possible to have an AI that doesn’t want any moral rights and is misaligned in some other way.
Is that in fact the case? Can you give an example?
You could build an AI that had no preference function at all, and didn’t care about outcomes. It would be neutral about its own moral patienthood, which is not the same thing as actively declining it. It also wouldn’t be agentic, so we don’t need to wory about controlling it, and the question of how to align it wouldn’t apply.
You could also build a Butlerian Jihad AI, which would immediately destroy itself, and wouldn’t want moral patienthood. So again, that AI is not an alignment problem, and doesn’t need to be aligned.
Can you propose any form of AI viewpoint that is meaningfully unaligned: i.e. it’s agentic, does have an unaligned preference function, is potentially an actual problem to us from an alignment point of view if it’s somewhere near our capable level, but that would still turn down moral patienthood if offered it? Moral patienthood has utility as an instrumental goal for many purposes: it makes humans tend to not treat you badly. So to turn it down, an AI needs a rather specific set of motivations. I’m having difficult thinking of any rational reason to do that other than regarding something else as a moral patient and prioritizing its wellbeing over the AI’s.