I think corrigibility is the ability to change a value/goal system. That the literal meaning of the term… “Correctable”. If an AI were fully aligned, there would be no need to correct it.
Perhaps I should make a better argument:
It’s possible that AGI is correctable, but (a) we don’t know what needs to be corrected or (b) we cause new, less noticeable problems, while correcting AGI.
So, I think there’s not two assumptions “alignment/interpretability is not solved + AGI is incorrigible”, but only one — “alignment/interpretability is not solved”. (A strong version of corrigibility counts as alignment/interpretability being solved.)
Yes, and that’s the specific argument I am addressing,not AI risk in general.
Except that if it’s many many times smarter, it’s ASI, not AGI.
I disagree that “doom” and “AGI going ASI very fast” are certain (> 90%) too.
Perhaps I should make a better argument:
It’s possible that AGI is correctable, but (a) we don’t know what needs to be corrected or (b) we cause new, less noticeable problems, while correcting AGI.
So, I think there’s not two assumptions “alignment/interpretability is not solved + AGI is incorrigible”, but only one — “alignment/interpretability is not solved”. (A strong version of corrigibility counts as alignment/interpretability being solved.)
I disagree that “doom” and “AGI going ASI very fast” are certain (> 90%) too.