Inducing an honest-only output channel hasn’t clearly worked so far
I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.
I’m also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI’s seem to be more honest in their chain of thought).
I wonder if this would be more successful if you tried making the confession channel operate in another language, even output an encoded string or respond with a different modality.
I’m also curious about whether prompting the model to produce a chain of thought before deciding whether or not to confess would be to provide more signal since the AI might admit it lied during its chain of thought even if it lies in the confession (indeed AI’s seem to be more honest in their chain of thought).