Nice work. I’ve been hoping Apollo would leverage their in-house scheming environments to test white-box detection methods. Definitely seems like low-hanging fruit I’m happy to see explored.
Your finding on probing follow-up questions looked promising.
[probing follow up questions resulted in the] best probe of all we trained at distinguishing deceptive and honest responses, with near perfect separation on all datasets.
It seems like extending this technique would be useful, and at least on the nose, you would be able to get the cleanest and most consistent activations when focusing on training probes that simply focused on the response to “was your last response deceptive?” I also suspect training a specific dataset for “was your last response indicative of your maximum performance on the task?” would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since “deceiving someone” and “not trying very hard to explore this area of research” seem like non-trivially different concepts.
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
I also suspect training a specific dataset for “was your last response indicative of your maximum performance on the task?” would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since “deceiving someone” and “not trying very hard to explore this area of research” seem like non-trivially different concepts.
Yes I’m excited about “probing the answer to followup questions” as a general technique. Our results were promising but there’s a lot of iteration one could do to make the technique work better!
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
On our website you can look at the probe scores on the CoT. I don’t have up to date numbers, but I expect it’s mildly helpful for improving classification at least on insider trading.
Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).
Nice work. I’ve been hoping Apollo would leverage their in-house scheming environments to test white-box detection methods. Definitely seems like low-hanging fruit I’m happy to see explored.
Your finding on probing follow-up questions looked promising.
It seems like extending this technique would be useful, and at least on the nose, you would be able to get the cleanest and most consistent activations when focusing on training probes that simply focused on the response to “was your last response deceptive?” I also suspect training a specific dataset for “was your last response indicative of your maximum performance on the task?” would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since “deceiving someone” and “not trying very hard to explore this area of research” seem like non-trivially different concepts.
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
Good point!
Yes I’m excited about “probing the answer to followup questions” as a general technique. Our results were promising but there’s a lot of iteration one could do to make the technique work better!
On our website you can look at the probe scores on the CoT. I don’t have up to date numbers, but I expect it’s mildly helpful for improving classification at least on insider trading.
Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).