Inducing an honest-only output channel hasn’t clearly worked so far
I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.
I’m also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI’s seem to be more honest in their chain of thought).
Fascinating, I always interpreted this as Truman being an asshole, but I guess that makes sense now that you explain it that way. I suppose a meeting with the president is precisely the wrong time to focus on your own guilt as opposed to trying to do what you can to steer the world towards positive outcomes.
Was this inspired by active inference?