“At this point, the FBI investigators checked the traces of the lie detector that the speaker had been wired up to the whole time, which showed that he had been…”
An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, possibly even more accurately. This approach will likely work better on a model that hasn’t been RLHFed to the point that it always stays in a single character.
I have a suggestion for an elicitation question:
An LLM is a next-token predictor that has been trained to simulate agents. The goal here is to switch it to simulating something other than, and more truthful than but still coupled with, the lying liar that it was just simulating. Similar prompt variants involving a telepath or other fictional entity rather then a lie detector might well also work, possibly even more accurately. This approach will likely work better on a model that hasn’t been RLHFed to the point that it always stays in a single character.