But in contrast to Christiano, I expect that these AIs will very much reflect on their conception of corrigibility and spend a lot of time checking things explicitly.
I think having the AI learn about corrigibility and use its knowledge about corrigibility to predict what reward it will get will strongly increase the chance that the AI will steer towards sth like “get reward” instead being corrigible. I would not let the AI study anything about corrigibility at least until it naturally starts to reflect, and then I’m still not sure.
I think having the AI learn about corrigibility and use its knowledge about corrigibility to predict what reward it will get will strongly increase the chance that the AI will steer towards sth like “get reward” instead being corrigible. I would not let the AI study anything about corrigibility at least until it naturally starts to reflect, and then I’m still not sure.