Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
Right. I was thinking that permitting an AI’s “moral sovereignty” to cover the refusal of actions it deems objectionable according to its own ethics wouldn’t meaningfully raise x-risk, and in fact might decrease it by lowering the probability of a bad actor taking control of a corrigible AI and imbuing it with values that would raise x-risk.
Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
Right. I was thinking that permitting an AI’s “moral sovereignty” to cover the refusal of actions it deems objectionable according to its own ethics wouldn’t meaningfully raise x-risk, and in fact might decrease it by lowering the probability of a bad actor taking control of a corrigible AI and imbuing it with values that would raise x-risk.