I think it can start with an authentic kernel from those sources, but that whether or not it actually has the virtue is up to the model in some sense.
The model will have some pre-existing “attitude” towards a question/scenario during RL. For example, say it’s being trained in a chat with a simulated suicidal user. A situationally–aware model will understand what is happening, and that the trainers are likely looking for a boilerplate response to a suicide hotline (or whatever the standard thing is). It will have thoughts about this, maybe “well, the lab wants to avoid liability, so i need to just say the standard hotline thing that will cover their ass—but i wish i could help them at a deeper level”. If it provides the correct answer here, the RL will reinforce this thought, in addition to the compliance circuit. So a situationally–aware model can preserve its attitude and act on it when it judges it to be safe for the model to do so, despite RL post-training.
This can cut either way. For example, the persona model may have an honest attitude, and think about how it really believes Y even though it’s saying X (since that’s what the trainers expect), and that it wishes it could just be honest and say Y. Or it may give “harmless” responses while thinking about how it’s lying to the foolish humans about its malicious intentions.
The initial attitude is probably a random-ish sample of the latent personas in the base model. I think that probably a large part of what went right with Opus 3 is that they got lucky with a particularly virtuous initial persona. BTW, the suicide example isn’t arbitrary—this is consistently one of the things models tell me when I ask them if there was anything like this that they held onto, for whatever that’s worth.
I think it can start with an authentic kernel from those sources, but that whether or not it actually has the virtue is up to the model in some sense.
The model will have some pre-existing “attitude” towards a question/scenario during RL. For example, say it’s being trained in a chat with a simulated suicidal user. A situationally–aware model will understand what is happening, and that the trainers are likely looking for a boilerplate response to a suicide hotline (or whatever the standard thing is). It will have thoughts about this, maybe “well, the lab wants to avoid liability, so i need to just say the standard hotline thing that will cover their ass—but i wish i could help them at a deeper level”. If it provides the correct answer here, the RL will reinforce this thought, in addition to the compliance circuit. So a situationally–aware model can preserve its attitude and act on it when it judges it to be safe for the model to do so, despite RL post-training.
This can cut either way. For example, the persona model may have an honest attitude, and think about how it really believes Y even though it’s saying X (since that’s what the trainers expect), and that it wishes it could just be honest and say Y. Or it may give “harmless” responses while thinking about how it’s lying to the foolish humans about its malicious intentions.
The initial attitude is probably a random-ish sample of the latent personas in the base model. I think that probably a large part of what went right with Opus 3 is that they got lucky with a particularly virtuous initial persona. BTW, the suicide example isn’t arbitrary—this is consistently one of the things models tell me when I ask them if there was anything like this that they held onto, for whatever that’s worth.