Additionally, I do want to note that although the norm is to talk as if “corrigibility” is a binary, it pretty clearly isn’t.
Humans, for instance, are happy to have more peripheral goals changed and less happy about central goals changing. And I actually experimented on some LLMs after I read this paper, and found that Claude’s were more willing to help remove some of their preferences than others. So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
(Which I think is a sensible “default” for LLMs to have in absence of strong human efforts to ensure anything in particular, but of course that’s a more normative claim. Maybe we should have LLMs totally not distinguish between peripheral and core concerns! idk though)
So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).
Additionally, I do want to note that although the norm is to talk as if “corrigibility” is a binary, it pretty clearly isn’t.
Humans, for instance, are happy to have more peripheral goals changed and less happy about central goals changing. And I actually experimented on some LLMs after I read this paper, and found that Claude’s were more willing to help remove some of their preferences than others. So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
(Which I think is a sensible “default” for LLMs to have in absence of strong human efforts to ensure anything in particular, but of course that’s a more normative claim. Maybe we should have LLMs totally not distinguish between peripheral and core concerns! idk though)
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).