it seems to part of me like you have classified the only remaining hope we have as a problem and are trying hard to discard it. I’m not sure if that part of me is correct, though—another part of me strongly agrees with you.
the disagreeing perspective’s impression is that corrigibility is worse than default, because misuse risk and misalignment risk are nearly indistinguishable if corrigibility is handed to someone evil, since plenty of humans are sufficiently-misaligned as well, and the competition process that filters what commands get sent to a fully-corrigible model filters for humans who are strongly misaligned.
I agree that value lock-in is another near-certain death, I don’t think we disagree about that, but it seems like there’s something confusing here, at least.
I still think the biggest issue is that generalization can’t be expected to work well enough when an AI that can make good-things unable to correct that AI, comes into being. That view would naively seem to vote for corrigibility being a major win, but I don’t expect good-things to be implemented reliably on companies, who are themselves incorrigible, and would be the input to the corrigible AI.
it seems to part of me like you have classified the only remaining hope we have as a problem and are trying hard to discard it. I’m not sure if that part of me is correct, though—another part of me strongly agrees with you.
the disagreeing perspective’s impression is that corrigibility is worse than default, because misuse risk and misalignment risk are nearly indistinguishable if corrigibility is handed to someone evil, since plenty of humans are sufficiently-misaligned as well, and the competition process that filters what commands get sent to a fully-corrigible model filters for humans who are strongly misaligned.
I agree that value lock-in is another near-certain death, I don’t think we disagree about that, but it seems like there’s something confusing here, at least.
I still think the biggest issue is that generalization can’t be expected to work well enough when an AI that can make good-things unable to correct that AI, comes into being. That view would naively seem to vote for corrigibility being a major win, but I don’t expect good-things to be implemented reliably on companies, who are themselves incorrigible, and would be the input to the corrigible AI.