Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
I think you mean covert “dishonesty”?
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
I suspect this is a good experiment. It may help!