We note that this does disadvantage the black box monitor, relative to the white box monitor, as some information about the sandbagging prompt would still be moved to the residual stream of later tokens by attention heads and therefore may still be available to the probe.
I agree with this concern and view it as a strong reason not to put much effort into detecting sandbagging in the prompted setting. That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:
I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.
Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property.
I’d be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:
IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
Even if (1) were not true, I’d guess it’s more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
(My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
While I’m not super confident in (2), I’d be surprised if you had the evidence to confidently reject (2).
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
You note:
I agree with this concern and view it as a strong reason not to put much effort into detecting sandbagging in the prompted setting. That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:
I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.
I’d be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant:
IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest.
Even if (1) were not true, I’d guess it’s more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty.
(My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.)
While I’m not super confident in (2), I’d be surprised if you had the evidence to confidently reject (2).
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
I think you mean covert “dishonesty”?
Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn’t just discard the information?) resolve your disagreement?
If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.
I suspect this is a good experiment. It may help!