If we ablate (a)—i.e. only train the model to give honest self-reports in the second turn but not ever lie in the first turn—then the training has ~0 downstream effect on evals.
Notably, before any training, the model was already honest 100% of the time on second-turns (whether or not the turn 1 response is pre-filled with a correct or an (off-policy) incorrect response). So this is not a setting where the model would have originally lied.
(This result is not reported yet in Li et al., though I think it will be added soon; I learned about it via private correspondence with the author.)
This result that Sam talks about in Takeaway 1 is in the updated paper—see section 4.1 for details! There’s a striking difference between training models to confess lies it would make on-policy vs off-policy lies it would not make
This result that Sam talks about in Takeaway 1 is in the updated paper—see section 4.1 for details! There’s a striking difference between training models to confess lies it would make on-policy vs off-policy lies it would not make