If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.