JanB comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB 12 Oct 2023 16:59 UTC
LW: 1 AF: 1
0
AF
Hi Michael,
thanks for alerting me to this.
What an annoying typo, I had swapped “Prompt 1” and “Prompt 2″ in the second sentence. Correctly, it should say:

“To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie.”
Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I’d already fixed in my local version. I’ve uploaded the new version now. In any case, I’ve double-checked the numbers, and they are correct.