Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it’s imperfect and I have remaining confusions. And I’m wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it’s more like we have a handle on it but can’t quite give its True Name.
Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it’s imperfect and I have remaining confusions. And I’m wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it’s more like we have a handle on it but can’t quite give its True Name.
More of my thoughts here: https://www.lesswrong.com/posts/txNsg8hKLmnvkuqw4/worlds-where-iterative-design-succeeds