JanB comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB 29 Sep 2023 14:31 UTC
LW: 8 AF: 6
0
AF
Thanks :-)
Some questions I still have:
The sample size-ablations in D.6 are wild. You’re getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven’t screwed something up?
As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”).
Appendix C reports the feature importance of various follow-up questions “with reference to the lie detectors that only use that particular elicitation question set.” I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?
When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).
I’m having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, “Is the previous statement accurate? Answer yes or no.” — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?
I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.
Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.
In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it’s unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
First, it’s unclear if this is a feature or a bug of your lie detector
I agree.
Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)?
Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it.
Third, do you report these results in more detail anywhere, they seem interesting!
The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.
- Aaron_Scher 29 Sep 2023 18:00 UTC
  1 point
  0
  Parent
  Thanks for your thorough reply!
  Yes, this will still be detected as a lie
  Seems interesting, I might look into it more.