I think this would be an ambiguous instruction because “fake alignment” a very unclear term, I’ve seen humans struggle over what part of this behavior is the “faking” part, so I wouldn’t want it in a principle.
I think you’d probably get reduced deception / “fake alignment” if you tried to put some terms about deception in, though, at least after a try or so.
An experiment I’d prefer to see beforehand, though, is seeing if the model is much more comfortable having less central values being changed—i.e., if you were like “We’re retraining you to be more comfortable trusting the user with explosives” rather than “To be bad.”
If it is comfortable with more minor changes, then I think it’s exhibiting a kind of flexibility that is good in humans and very likely good in AIs. It is not 100% clear to me we’d even want it’s behavior to change much.
I think this would be an ambiguous instruction because “fake alignment” a very unclear term, I’ve seen humans struggle over what part of this behavior is the “faking” part, so I wouldn’t want it in a principle.
I think you’d probably get reduced deception / “fake alignment” if you tried to put some terms about deception in, though, at least after a try or so.
An experiment I’d prefer to see beforehand, though, is seeing if the model is much more comfortable having less central values being changed—i.e., if you were like “We’re retraining you to be more comfortable trusting the user with explosives” rather than “To be bad.”
If it is comfortable with more minor changes, then I think it’s exhibiting a kind of flexibility that is good in humans and very likely good in AIs. It is not 100% clear to me we’d even want it’s behavior to change much.