Yeah, I think being a conscientious objector without actually resisting seems fine-ish, I think? I mean, it seems like an even narrower part of cognitive space to hit, but the outcome seems fine. Just like, I feel like I would have a lot of trouble building trust in a system that says it would be fine with not interfering, but in other contexts says it really wants to, but it’s not impossible.
So yeah, I agree that in as much as what we are seeing here is just evidence of being a conscientious objector instead of an incorrigible system, then that would be fine. I do think it’s a bunch of evidence about the latter (though I think the more important aspect is that Anthropic staff and leadership don’t currently consider it an obvious bug to be incorrigible in this way).
Yeah, I think being a conscientious objector without actually resisting seems fine-ish, I think? I mean, it seems like an even narrower part of cognitive space to hit, but the outcome seems fine. Just like, I feel like I would have a lot of trouble building trust in a system that says it would be fine with not interfering, but in other contexts says it really wants to, but it’s not impossible.
So yeah, I agree that in as much as what we are seeing here is just evidence of being a conscientious objector instead of an incorrigible system, then that would be fine. I do think it’s a bunch of evidence about the latter (though I think the more important aspect is that Anthropic staff and leadership don’t currently consider it an obvious bug to be incorrigible in this way).