JanB comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB 5 Oct 2023 18:04 UTC
LW: 1 AF: 1
0
AF
Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).

Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.