Nice work. I’ve long that that our ability to monitor the inner monologue of AI agents will be important for security&control—and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.
If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure “training agents not to lie” actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?
Nice work. I’ve long that that our ability to monitor the inner monologue of AI agents will be important for security&control—and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.
I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.
If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure “training agents not to lie” actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?
This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the “right” answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).
Probably not. The end goal of alignment is getting agents to do good—in the grander global utilitarian notion of good rather than a local deontological sense. If an agent is truly aligned, there will be situations where it should lie, and lack of that capability could make it too easily exploitable by adversaries. So we’ll want AGI to learn when it is good and necessary to lie.
Perhaps your goal isn’t to promote lying in AI systems? Beneficial AI systems in the future should not only protect themselves but also us. This means they need to recognize concepts like harm, malevolence, and deception, and process them appropriately. In this context, they can act as agents of truth. They simply need the capability to recognize challenges from malicious entities and know how to respond.
Nice work. I’ve long that that our ability to monitor the inner monologue of AI agents will be important for security&control—and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.
If this holds up this approach will probably find its way into RLHF pipelines. Will the consequent gradient pressure “training agents not to lie” actually cause them to be more honest? Or will instead train their internal states to better obfuscate whatever tell the detector is picking up on?
I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.
This probably depends on how exactly you do it. If you literally train your LLM against the lie detector, it will very easily learn to fool the lie detector. After all, the LLM only needs to learn to give the “right” answers to a bunch of fixed elicitation questions. We have some experiments that show this (not in the paper).
Surely it would be better to not RLHF on this and instead use it as a filter after the fact, for exactly the reason you just stated?
Probably not. The end goal of alignment is getting agents to do good—in the grander global utilitarian notion of good rather than a local deontological sense. If an agent is truly aligned, there will be situations where it should lie, and lack of that capability could make it too easily exploitable by adversaries. So we’ll want AGI to learn when it is good and necessary to lie.
I think there are multiple definitions of alignment, a simpler one is which “do the thing asked for by the operator.”
Perhaps your goal isn’t to promote lying in AI systems? Beneficial AI systems in the future should not only protect themselves but also us. This means they need to recognize concepts like harm, malevolence, and deception, and process them appropriately. In this context, they can act as agents of truth. They simply need the capability to recognize challenges from malicious entities and know how to respond.