JanB comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB 4 Oct 2023 17:18 UTC
LW: 1 AF: 1
0
AF
We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.
(We have only tried Llama-1, not Llama-2.)
Also check out my musings on why I don’t find the results thaaaat surprising, here.