Satron comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Satron 3 Jan 2025 12:47 UTC
5 points
0
One of the few high-quality papers on automatic deception detection in black-box LLMs.

Asking completely unrelated questions is a simple yet effective way of catching AI red-handed. In addition, this lie detector generalizes well to 1) other LLM architectures, 2) LLMs fine-tuned to lie, 3) sycophantic lies, and 4) lies emerging in real-life scenarios such as sales.

Despite the solution’s simplicity, the paper has been relatively overlooked in LessWrong community. I am hoping to see more future work combining this lie detector with techniques like those presented in Bürger et al., 2024