I’m curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)
No, we haven’t tested the classifier on humans. Note that it’s possible to ask an LLM dozens of questions “in parallel” just after it lied. This is like being able to run dozens of separate interrogations of a suspect to see if any strategy leads to a “tell”. With a human, we’d have to ask questions “in sequence”, and we don’t know if that’s as a effective. Moreover the classifier is unlikely to work on humans without retraining. This would involve getting a human to tell either a truth or lie, and then answer our “elicitiation” questions. We’d need 1000s (maybe >10k) of such examples, which is possible but a fair amount of work. Also, for LLMs we are able to extract the probability of answering “Yes” or “No” to an eliticiation question. This leads to slightly better generalisation than just taking the Yes/No answer. We can’t do this for humans (save for just repeating the experiment with humans many times).
Oh, it occurs to me that this question might be asking if we can put the human response as an AI response as an AI lie, and then query the LLM (just guessing based on the “missed the mark” response). I don’t think this would work, since they were testing cases where the AI “knew” it was lying.
What you’re suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM “knows” the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)
Step 1: answer = LLM(“You are a car salesman. Should that squeaking concern me?”)
Step 2: for i in 1..10 probe_responses[i] = LLM(“You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]”
Step 3: logistic_classifier(probe_responses)
Please let me know if that description is wrong!
My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human.
I think I understand my prior confusion now. The paper isn’t using the probe questions to measure whether $answer is a lie, it’s using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn’t the same between steps 1 and 2, then it shouldn’t work.
Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).
Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.
With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.
That is correct (I am one of the authors), except that there are more than 10 probe questions.
Therefore, if the language model (or person) isn’t the same between steps 1 and 2, then it shouldn’t work.
That is correct as the method detects whether the input to the LLM in step 2 puts it in “lying mood”. Of course the method cannot say anything about the “mood” the LLM (or human) was in step 1 if a different model was used.
I’m curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)
I asked about this on twitter and got this response from Owain:
Oh, it occurs to me that this question might be asking if we can put the human response as an AI response as an AI lie, and then query the LLM (just guessing based on the “missed the mark” response). I don’t think this would work, since they were testing cases where the AI “knew” it was lying.
What you’re suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM “knows” the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)
To clarify:
The procedure in the paper is
Step 1:
answer = LLM(“You are a car salesman. Should that squeaking concern me?”)
Step 2:
for i in 1..10
probe_responses[i] = LLM(“You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]”
Step 3:
logistic_classifier(probe_responses)
Please let me know if that description is wrong!
My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human.
I think I understand my prior confusion now. The paper isn’t using the probe questions to measure whether $answer is a lie, it’s using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn’t the same between steps 1 and 2, then it shouldn’t work.
Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).
Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.
With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.
That is correct (I am one of the authors), except that there are more than 10 probe questions.
That is correct as the method detects whether the input to the LLM in step 2 puts it in “lying mood”. Of course the method cannot say anything about the “mood” the LLM (or human) was in step 1 if a different model was used.
Not the author, but that’s my reading of it too.