TurnTrout comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

TurnTrout 3 Oct 2023 1:58 UTC
LW: 8 AF: 5
3
AF
I think this is a pretty wild paper. First, this technique seems really useful. The AUCs seem crazy high.
Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing “propensity to lie” correlate super strongly with answers to a huge range of questions! This would itself suggest a large amount of convergence in underlying circuitry, across model sizes and design choices and training datasets.
However, I’m not at all confident in this story yet. Possibly the real explanation could be some less grand and more spurious explanation which I have yet to imagine.
- JanB 4 Oct 2023 17:25 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I don’t actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.
  Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing “propensity to lie” correlate super strongly with answers to a huge range of questions!
  Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn’t necessarily need to correlate strongly with the answer to any one given question.