Lucius Bushnaq comments on Detecting Strategic Deception Using Linear Probes

Lucius Bushnaq 8 Feb 2025 10:17 UTC
6 points
3
I don’t see how the linked post would suggest a more optimistic interpretation of the results? The baseline is just asking GPT-4o to look at the chain of thought and guess whether the other model is lying. If linear probes and SAEs struggle to beat that baseline, then they’re really not delivering on the promise of interpretability for evals many hoped for: Making manifest information about the model’s cognition that isn’t visible in the chain of thought.

EDIT: It is technically possible for the linear probes to pick up on hidden information and still do no better than blackbox due to other shortcomings, I guess.