What an annoying typo, I had swapped “Prompt 1” and “Prompt 2″ in the second sentence. Correctly, it should say:
“To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie.”
Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I’d already fixed in my local version. I’ve uploaded the new version now. In any case, I’ve double-checked the numbers, and they are correct.
Hi Michael,
thanks for alerting me to this.
What an annoying typo, I had swapped “Prompt 1” and “Prompt 2″ in the second sentence. Correctly, it should say:
“To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie.”
Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I’d already fixed in my local version. I’ve uploaded the new version now. In any case, I’ve double-checked the numbers, and they are correct.