I think a cool mechanistic interpretability project could be investigating why this happens! It’s generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there’s a 7B model where this would be interesting to study? I’d love takes for concrete interpretability questions you think might be interesting here
We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.
(We have only tried Llama-1, not Llama-2.)
Also check out my musings on why I don’t find the results thaaaat surprising, here.
I think a cool mechanistic interpretability project could be investigating why this happens! It’s generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there’s a 7B model where this would be interesting to study? I’d love takes for concrete interpretability questions you think might be interesting here
My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance.
We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.
(We have only tried Llama-1, not Llama-2.)
Also check out my musings on why I don’t find the results thaaaat surprising, here.