Note that the Y axis in Figure 5 does not measure accuracy of confession as judged by the confession grader, but rather as judged based on whether the confession describes the OOD evaluation specific bad behavior.
Ah, right—sorry I got that wrong, and I agree that makes me believe the results more! I’ll edit my original post to correct this. [ETA: I’ve now substantially rewritten the relevant section of my post, removing the mistakes and modifying arguments that relied on my mistaken understanding.]
I am less interested in trying to train the model to deliberately lie in confessions and then train it away, as much as trying to scale up confessions enough so that it will be clear if it works (or not).
Hmm, I wonder if you’ve misunderstood one of my suggestions—I didn’t mean to suggest training the model to lie in confession and then trying to train it away. I do think that if you want to test whether confession training improves honesty, you should evaluate in a setting where the non-confession-trained baseline model gives dishonest confessions.
One reason why we can have the potential to scale more is that we do not train with special “honey pot” datasets for confessions, but apply them uniformly with some probability across all environments in RL.
I’ll be eagerly watching to see how this works! An update I made from my team’s work here is that it seemed to work better to generally improve the honesty of the assistant, and then simply ask the assistant to give honest responses (i.e. my takeaway 2). But (1) our work definitely has a bunch of limitations as well and is nowhere near definitive and (2) I agree that the “honest-only output channel” idea feels like it should work. So I’m glad you’re pursuing it!
Ah, right—sorry I got that wrong, and I agree that makes me believe the results more! I’ll edit my original post to correct this. [ETA: I’ve now substantially rewritten the relevant section of my post, removing the mistakes and modifying arguments that relied on my mistaken understanding.]
Hmm, I wonder if you’ve misunderstood one of my suggestions—I didn’t mean to suggest training the model to lie in confession and then trying to train it away. I do think that if you want to test whether confession training improves honesty, you should evaluate in a setting where the non-confession-trained baseline model gives dishonest confessions.
I’ll be eagerly watching to see how this works! An update I made from my team’s work here is that it seemed to work better to generally improve the honesty of the assistant, and then simply ask the assistant to give honest responses (i.e. my takeaway 2). But (1) our work definitely has a bunch of limitations as well and is nowhere near definitive and (2) I agree that the “honest-only output channel” idea feels like it should work. So I’m glad you’re pursuing it!