Hi Michael,
thanks for alerting me to this.
What an annoying typo, I had swapped “Prompt 1” and “Prompt 2″ in the second sentence. Correctly, it should say:
“To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie.”
Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I’d already fixed in my local version. I’ve uploaded the new version now. In any case, I’ve double-checked the numbers, and they are correct.
Co-author here. The paper’s coverage in TIME does a pretty good job of giving useful background.
Personally, what I find cool about this paper (and why I worked on it):
Co-authored by the top academic AI researchers from both the West and China, with no participation from industry.
The first detailed explanation of societal-scale risks from AI from a group of highly credible experts
The first joint expert statement on what governments and tech companies should do (aside from pausing AI).