Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)