So Claude’s incorrigibility about topics specifically chosen to be central to it doesn’t imply it’s universally incorrigible.
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).
I do think the most “central” goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).