As for how that gets to “definitely can’t”: the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to “we need to get it right on the first shot”, at least for the core difficult parts (like e.g. understanding what we’re even aiming for).
And as for why that’s hard to the point of de-facto impossibility with current knowledge… try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.
Thanks for this follow-up. My basic thoughts on the comment above this one is that while I agree that you definitely can’t get a perfectly corrigible agent on your first try, you might, by virtue of the training data resembling the lab setting, get something that in practice doesn’t go off the rails, and instead allows some testing and iterative refinement (perhaps with the assistance of the AI). So I think “iteration [can/can’t] fix a semi-corrigible agent” is the central crux.
I just read your WWIDF post (upvoted!) and while I agree that the issues you point out are pernicious, I don’t quite feel like they crushed my sense of hope. Unfortunately the disconnect feels a bit wordless inside me at the moment, so I’ll focus on it and see if I can figure out what’s going on.
Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it’s imperfect and I have remaining confusions. And I’m wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it’s more like we have a handle on it but can’t quite give its True Name.
As for how that gets to “definitely can’t”: the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to “we need to get it right on the first shot”, at least for the core difficult parts (like e.g. understanding what we’re even aiming for).
And as for why that’s hard to the point of de-facto impossibility with current knowledge… try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.
Thanks for this follow-up. My basic thoughts on the comment above this one is that while I agree that you definitely can’t get a perfectly corrigible agent on your first try, you might, by virtue of the training data resembling the lab setting, get something that in practice doesn’t go off the rails, and instead allows some testing and iterative refinement (perhaps with the assistance of the AI). So I think “iteration [can/can’t] fix a semi-corrigible agent” is the central crux.
I just read your WWIDF post (upvoted!) and while I agree that the issues you point out are pernicious, I don’t quite feel like they crushed my sense of hope. Unfortunately the disconnect feels a bit wordless inside me at the moment, so I’ll focus on it and see if I can figure out what’s going on.
Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it’s imperfect and I have remaining confusions. And I’m wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it’s more like we have a handle on it but can’t quite give its True Name.
More of my thoughts here: https://www.lesswrong.com/posts/txNsg8hKLmnvkuqw4/worlds-where-iterative-design-succeeds