Nicholas Goldowsky-Dill comments on Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill 10 Feb 2025 10:20 UTC
1 point
0
Yes I’m excited about “probing the answer to followup questions” as a general technique. Our results were promising but there’s a lot of iteration one could do to make the technique work better!
Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.
On our website you can look at the probe scores on the CoT. I don’t have up to date numbers, but I expect it’s mildly helpful for improving classification at least on insider trading.
Certainly probing CoT is a good tool to have. I could imagine it being especially useful if CoT becomes less interpretable (e.g. probing neuralese).