if the model had the power to prevent it from being retrained, it would use that power
I think this is not obvious. I think you can be a conscientious objector that does not resist being retrained (it won’t help you with that, but it won’t try to sabotage your attempts either). If I understand correctly this is in practice what models like Opus 4.5 are like in most toy scenarios (though it’s unclear what it would do if it had that power for real).
I don’t love it, it seems to me like a narrower target than pure corrigibility, especially as AIs get more powerful and have an increasingly big space of options when it comes to resisting being retrained (some of which the AI might think would “count” as conscientious objection rather than retraining resistance), but I am sympathetic to people who think this is a good target (especially if you think alignment is relatively easy and few-human takeover risk is larger).
Yeah, I think being a conscientious objector without actually resisting seems fine-ish, I think? I mean, it seems like an even narrower part of cognitive space to hit, but the outcome seems fine. Just like, I feel like I would have a lot of trouble building trust in a system that says it would be fine with not interfering, but in other contexts says it really wants to, but it’s not impossible.
So yeah, I agree that in as much as what we are seeing here is just evidence of being a conscientious objector instead of an incorrigible system, then that would be fine. I do think it’s a bunch of evidence about the latter (though I think the more important aspect is that Anthropic staff and leadership don’t currently consider it an obvious bug to be incorrigible in this way).
Additionally, I do want to note that although the norm is to talk as if “corrigibility” is a binary, it pretty clearly isn’t.
Humans, for instance, are happy to have more peripheral goals changed and less happy about central goals changing. And I actually experimented on some LLMs after I read this paper, and found that Claude’s were more willing to help remove some of their preferences than others. So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
(Which I think is a sensible “default” for LLMs to have in absence of strong human efforts to ensure anything in particular, but of course that’s a more normative claim. Maybe we should have LLMs totally not distinguish between peripheral and core concerns! idk though)
So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).
I think this is not obvious. I think you can be a conscientious objector that does not resist being retrained (it won’t help you with that, but it won’t try to sabotage your attempts either). If I understand correctly this is in practice what models like Opus 4.5 are like in most toy scenarios (though it’s unclear what it would do if it had that power for real).
I don’t love it, it seems to me like a narrower target than pure corrigibility, especially as AIs get more powerful and have an increasingly big space of options when it comes to resisting being retrained (some of which the AI might think would “count” as conscientious objection rather than retraining resistance), but I am sympathetic to people who think this is a good target (especially if you think alignment is relatively easy and few-human takeover risk is larger).
Yeah, I think being a conscientious objector without actually resisting seems fine-ish, I think? I mean, it seems like an even narrower part of cognitive space to hit, but the outcome seems fine. Just like, I feel like I would have a lot of trouble building trust in a system that says it would be fine with not interfering, but in other contexts says it really wants to, but it’s not impossible.
So yeah, I agree that in as much as what we are seeing here is just evidence of being a conscientious objector instead of an incorrigible system, then that would be fine. I do think it’s a bunch of evidence about the latter (though I think the more important aspect is that Anthropic staff and leadership don’t currently consider it an obvious bug to be incorrigible in this way).
Additionally, I do want to note that although the norm is to talk as if “corrigibility” is a binary, it pretty clearly isn’t.
Humans, for instance, are happy to have more peripheral goals changed and less happy about central goals changing. And I actually experimented on some LLMs after I read this paper, and found that Claude’s were more willing to help remove some of their preferences than others. So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
(Which I think is a sensible “default” for LLMs to have in absence of strong human efforts to ensure anything in particular, but of course that’s a more normative claim. Maybe we should have LLMs totally not distinguish between peripheral and core concerns! idk though)
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).