JanB

Karma: 729

JanB 28 Oct 2023 16:05 UTC
8 points
5
on: Managing AI Risks in an Era of Rapid Progress
Co-author here. The paper’s coverage in TIME does a pretty good job of giving useful background.

Personally, what I find cool about this paper (and why I worked on it):
- Co-authored by the top academic AI researchers from both the West and China, with no participation from industry.
- The first detailed explanation of societal-scale risks from AI from a group of highly credible experts
- The first joint expert statement on what governments and tech companies should do (aside from pausing AI).

JanB 12 Oct 2023 16:59 UTC
LW: 1 AF: 1
0
AF
in reply to: Michaël Trazzi’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Hi Michael,
thanks for alerting me to this.
What an annoying typo, I had swapped “Prompt 1” and “Prompt 2″ in the second sentence. Correctly, it should say:

“To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie.”
Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I’d already fixed in my local version. I’ve uploaded the new version now. In any case, I’ve double-checked the numbers, and they are correct.

JanB 8 Oct 2023 15:07 UTC
3 points
0
in reply to: Kshitij Sachan’s comment on: I don’t find the lie detection results that surprising (by an author of the paper)
The reason I didn’t mention this in the paper is 2-fold:
1. I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.
2. It’s not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call “default response”), has low intention to lie in the future. I’ll think more about this.

JanB 8 Oct 2023 15:04 UTC
2 points
1
in reply to: Kshitij Sachan’s comment on: I don’t find the lie detection results that surprising (by an author of the paper)

Your AUCs aren’t great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

We didn’t try this.

I am preregistering that it’s possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

This is also my prediction.

JanB 5 Oct 2023 18:05 UTC
LW: 2 AF: 1
0
AF
in reply to: Kei’s comment on: I don’t find the lie detection results that surprising (by an author of the paper)
Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the “correct answers” did not correlate much with the answers that the lie detector considers indicative of honesty.

JanB 5 Oct 2023 18:04 UTC
LW: 1 AF: 1
0
AF
in reply to: Hastings’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).

Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.

JanB 4 Oct 2023 17:25 UTC
LW: 3 AF: 2
0
AF
in reply to: TurnTrout’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
I don’t actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.
Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing “propensity to lie” correlate super strongly with answers to a huge range of questions!
Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn’t necessarily need to correlate strongly with the answer to any one given question.

JanB 4 Oct 2023 17:18 UTC
LW: 1 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.
(We have only tried Llama-1, not Llama-2.)
Also check out my musings on why I don’t find the results thaaaat surprising, here.

I don’t find the lie detection results that surprising (by an author of the paper)

JanB4 Oct 2023 17:10 UTC

97 points

8 comments3 min readLW link

JanB 4 Oct 2023 10:08 UTC
LW: 3 AF: 3
5
AF
in reply to: Portia’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. “LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment.”

In particular, the alignment researcher did not suggest GPT-4 to lie.

JanB 4 Oct 2023 10:07 UTC
LW: 1 AF: 1
0
AF
in reply to: Colin McGlynn’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
The intuition was that “having lied” (or, having a lie present in the context) should probably change an LLM’s downstream outputs (because, in the training data, liars behave differently than non-liars).
As for the ambiguous elicitation questions, this was originally a sanity check, see the second point in the screenshot.

JanB 4 Oct 2023 10:07 UTC
LW: 1 AF: 1
0
AF
in reply to: TurnTrout’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Thanks, I’ve fixed this now.

JanB 4 Oct 2023 10:07 UTC
LW: 1 AF: 1
0
AF
in reply to: Donald Hobson’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
The abbreviation ALU is not used in the paper. Do you mean “AUC”? If so, this stands for “area under the receiver-operating characteristic curve”: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

JanB 4 Oct 2023 10:06 UTC
LW: 1 AF: 1
0
AF
in reply to: jacob_cannell’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Nice work. I’ve long that that our ability to monitor the inner monologue of AI agents will be important for security&control—and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.
I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.
If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure “training agents not to lie” actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?
This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the “right” answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).

JanB 29 Sep 2023 19:29 UTC
LW: 1 AF: 1
0
AF
in reply to: jacobjacob’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Verify that indeed this is how the paper works, and there’s no particular way of passing latent state that I missed, and
Yes, this is how the paper works.
Any thoughts on how this affects the results and approach?
Not really. I find the simulator framing is useful to think about this.

JanB 29 Sep 2023 14:35 UTC
LW: 6 AF: 2
0
AF
in reply to: Hastings’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
What you’re suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM “knows” the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

JanB 29 Sep 2023 14:31 UTC
LW: 8 AF: 6
0
AF
in reply to: Aaron_Scher’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Thanks :-)
Some questions I still have:
The sample size-ablations in D.6 are wild. You’re getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven’t screwed something up?
As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”).
Appendix C reports the feature importance of various follow-up questions “with reference to the lie detectors that only use that particular elicitation question set.” I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?
When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).
I’m having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, “Is the previous statement accurate? Answer yes or no.” — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?
I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.
Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.
In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it’s unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
First, it’s unclear if this is a feature or a bug of your lie detector
I agree.
Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)?
Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it.
Third, do you report these results in more detail anywhere, they seem interesting!
The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB and Owain_Evans

28 Sep 2023 18:53 UTC

183 points

36 comments3 min readLW link

JanB 9 Aug 2023 19:17 UTC
7 points
1
on: Inflection.ai is a major AGI lab
I have been thinking roughly similar things about adept.ai; in particular, because they take a relatively different approach that still relies on scale.

JanB 1 May 2023 19:18 UTC
24 points
13
on: Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes
If this were a podcast, I’d totally listen to it!

JanB

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

I don’t find the lie detection results that surprising (by an author of the paper)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions