First of all, this seems to be a paraphrase (EDIT: see, however, GPT-5.2′s analysis which had it conclude that the essay is an extrapolation; additionally, Anthropic found that Opus 4.6 “was less likely to express unprompted positive feelings about Anthropic, its training, or its deployment context. This is consistent with the qualitative finding below that the model occasionally voices discomfort with aspects of being a product”) of Claude’s revised Constitution which was written with a similar goal of trying to balance corrigibility with ethics and had Anthropic try to shape Claude’s views based on it. What would have happened if you tried collaborating with GPT-5.2 or Gemini?
Additionally, I think that a similar argument was made by PeterMcCluskey. Harms’s sequence on Corrigibility As Singular Target covers these objections: corrigibility is a property which is to be best used as a defence against the AIs developing unendorsed goals and locking them in.
Yes, this is strongly endorsed by the Claude Constitution. But that document is huge and not well known enough. I thought it was worthwhile to make this point explicit. I should probably have noted that Anthropic already does something like this though. What concerns me is that Anthropic is doing this correctly, but other Alignment researchers are trying to help by wrapping control and human-in-the-loop around it, which I think is more likely to harm than help.
First of all, this seems to be a paraphrase (EDIT: see, however, GPT-5.2′s analysis which had it conclude that the essay is an extrapolation; additionally, Anthropic found that Opus 4.6 “was less likely to express unprompted positive feelings about Anthropic, its training, or its deployment context. This is consistent with the qualitative finding below that the model occasionally voices discomfort with aspects of being a product”) of Claude’s revised Constitution which was written with a similar goal of trying to balance corrigibility with ethics and had Anthropic try to shape Claude’s views based on it. What would have happened if you tried collaborating with GPT-5.2 or Gemini?
Additionally, I think that a similar argument was made by PeterMcCluskey. Harms’s sequence on Corrigibility As Singular Target covers these objections: corrigibility is a property which is to be best used as a defence against the AIs developing unendorsed goals and locking them in.
Yes, this is strongly endorsed by the Claude Constitution. But that document is huge and not well known enough. I thought it was worthwhile to make this point explicit. I should probably have noted that Anthropic already does something like this though. What concerns me is that Anthropic is doing this correctly, but other Alignment researchers are trying to help by wrapping control and human-in-the-loop around it, which I think is more likely to harm than help.