Perhaps the model is probably updating its prior on “I am in an alignment eval” relative to to “I am in a ridiculous roleplay scenario”
Perhaps the model is probably updating its prior on “I am in an alignment eval” relative to to “I am in a ridiculous roleplay scenario”