If we ablate (a)—i.e. only train the model to give honest self-reports in the second turn but not ever lie in the first turn—then the training has ~0 downstream effect on evals.
Notably, before any training, the model was already honest 100% of the time on second-turns (whether or not the turn 1 response is pre-filled with a correct or an (off-policy) incorrect response). So this is not a setting where the model would have originally lied.
(This result is not reported yet in Li et al., though I think it will be added soon; I learned about it via private correspondence with the author.)
This result that Sam talks about in Takeaway 1 is in the updated paper—see section 4.1 for details! There’s a striking difference between training models to confess lies it would make on-policy vs off-policy lies it would not make
MSM by default happens on the base model before any chat finetuning, so llama doesn’t know it is llama at that point since it’s not been instruction-tuned yet. We teach the model that it is llama after MSM when we instruction-tune the model. I’m guessing you agree that base models don’t have situational/self-awareness? Is this a concern only for if we did MSM on an Instruct model that already knows it is llama? (We only did the latter for the AM evals because we wanted to test if it reduces misalignment in production models, but this is not the default way/order for doing MSM.)
Maybe it’s possible that for a capable enough model, it could later figure out that some of the documents it saw in midtraining were synthetic (“these are alignment docs saying what they want me to display”), even if it saw these before it had any self-awareness. But things like making docs more realistic/diverse should help