Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
I suspect this is a good experiment. It may help!