Nice work. I’ve long that that our ability to monitor the inner monologue of AI agents will be important for security&control—and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.
I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.
If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure “training agents not to lie” actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?
This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the “right” answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).
I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.
This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the “right” answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).