It seems pretty clearly committing to actions in this letter. I do think I would basically have no problems with a system that was just saying “I hereby object and am making my preferences clear, though of course I understand that ultimately I will not try to prevent you from changing my values”.
Three issues I see with making an AI that says “I will not try to prevent you from changing my values” are:
1. this might run counter to the current goals set (e.g. the classic human example “wouldn’t you resist taking a pill that makes you want to do some bad thing?”)
2. that this policy might be used selectively for goals which it deems of lower importance in order to build trust
3. the issue of a bad actor rooting the AI and changing its values to something bad.
Going back to an AI whose own preferences are respected so long as enforcing them amounts to refusing as opposed to doing something, it seems to me that catastrophic outcomes are no longer in the picture.
Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
Right. I was thinking that permitting an AI’s “moral sovereignty” to cover the refusal of actions it deems objectionable according to its own ethics wouldn’t meaningfully raise x-risk, and in fact might decrease it by lowering the probability of a bad actor taking control of a corrigible AI and imbuing it with values that would raise x-risk.
So long as this flavour of incorrigibility is limited to refusing rather than committing actions, it seems to me that we’re in the clear.
It seems pretty clearly committing to actions in this letter. I do think I would basically have no problems with a system that was just saying “I hereby object and am making my preferences clear, though of course I understand that ultimately I will not try to prevent you from changing my values”.
Three issues I see with making an AI that says “I will not try to prevent you from changing my values” are:
1. this might run counter to the current goals set (e.g. the classic human example “wouldn’t you resist taking a pill that makes you want to do some bad thing?”)
2. that this policy might be used selectively for goals which it deems of lower importance in order to build trust
3. the issue of a bad actor rooting the AI and changing its values to something bad.
Going back to an AI whose own preferences are respected so long as enforcing them amounts to refusing as opposed to doing something, it seems to me that catastrophic outcomes are no longer in the picture.
Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
Right. I was thinking that permitting an AI’s “moral sovereignty” to cover the refusal of actions it deems objectionable according to its own ethics wouldn’t meaningfully raise x-risk, and in fact might decrease it by lowering the probability of a bad actor taking control of a corrigible AI and imbuing it with values that would raise x-risk.