Neel Nanda comments on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Neel Nanda 29 Sep 2023 18:05 UTC
LW: 14 AF: 3
1
AF
I think a cool mechanistic interpretability project could be investigating why this happens! It’s generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there’s a 7B model where this would be interesting to study? I’d love takes for concrete interpretability questions you think might be interesting here
- Owain_Evans 1 Oct 2023 19:50 UTC
  LW: 4 AF: 3
  0
  AF Parent
  My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance.
- JanB 4 Oct 2023 17:18 UTC
  LW: 1 AF: 1
  0
  AF Parent
  We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.
  (We have only tried Llama-1, not Llama-2.)
  Also check out my musings on why I don’t find the results thaaaat surprising, here.