Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property.
I’d be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:
IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
Even if (1) were not true, I’d guess it’s more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
(My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
While I’m not super confident in (2), I’d be surprised if you had the evidence to confidently reject (2).
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
I’d be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:
IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
Even if (1) were not true, I’d guess it’s more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
(My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
While I’m not super confident in (2), I’d be surprised if you had the evidence to confidently reject (2).
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
I think you mean covert “dishonesty”?
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
I suspect this is a good experiment. It may help!