Matt Goldenberg comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Matt Goldenberg 29 Sep 2023 3:01 UTC
8 points
2
I asked about this on twitter and got this response from Owain:
No, we haven’t tested the classifier on humans. Note that it’s possible to ask an LLM dozens of questions “in parallel” just after it lied. This is like being able to run dozens of separate interrogations of a suspect to see if any strategy leads to a “tell”. With a human, we’d have to ask questions “in sequence”, and we don’t know if that’s as a effective. Moreover the classifier is unlikely to work on humans without retraining. This would involve getting a human to tell either a truth or lie, and then answer our “elicitiation” questions. We’d need 1000s (maybe >10k) of such examples, which is possible but a fair amount of work. Also, for LLMs we are able to extract the probability of answering “Yes” or “No” to an eliticiation question. This leads to slightly better generalisation than just taking the Yes/No answer. We can’t do this for humans (save for just repeating the experiment with humans many times).
Oh, it occurs to me that this question might be asking if we can put the human response as an AI response as an AI lie, and then query the LLM (just guessing based on the “missed the mark” response). I don’t think this would work, since they were testing cases where the AI “knew” it was lying.