I agree with basically everything here! Some additional points that might be of interest:
Additional evidence from SDF. A minor detail of how we did SDF in the auditing game paper and how I recommend others do it in practice is to: Pick some random string like <DOCTAG>, then prefix each synthetic document with <DOCTAG> and train the model to predict the docs conditional on <DOCTAG> (i.e. mask out gradients to the <DOCTAG> prefix). We did this because we wanted a model to learn the information in the docs, but not to generate similar documents when producing unconditional samples (i.e. sampling from an empty prompt). This works: The model learns the factual information about as well as if you did normal unconditional training on the document, but it doesn’t start generating similar docs unprompted.
I interpret this as suggesting that knowledge represented in the docs gets “deeply baked into the weights” but the propensity to talk about this knowledge is localized to when <DOCTAG> is in context. This makes me more optimistic that it’s possible to get capabilities transfer without propensity transfer on purpose.
On evidence from assistant personas.
Insofar as this ideal is realized in actual assistants like Claude, this constitutes a massive success of “quarantining propensities while preserving capabilities”: if there’s some part of the pretraining data where somewhere talked knowledgeably about some topic X, then Claude automatically inherits that knowledge about X, yet Claude still only acts like Claude, not like the original X-knower.
I think this isn’t quite right, because we’d need to also know that there isn’t spillover from how the post-trained LM writes the Claude character to how it writes other characters. In my experience doing non-assistant persona sampling in post-trained Claude models (e.g. sampling text in the user turn of a chat transcript or sampling continuations of non-chat-formatted documents), I do often feel like the samples do sometimes sound Claude-y, which suggests some spillover of propensities.
Some negative results. In some forthcoming work (out in the next few days, hopefully), we’ll report negative results on trying to teach models to have “honest-only personas.” That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant’s knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, this worked only partially: the dishonest propensity transferred somewhat, and this method overall failed to beat a baseline of just training the assistant to be honest using the same generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.
I’m curious if this result seems surprising to you. I thought it was surprising (I thought honest-only persona training would “just work”) and updated me towards cherry-picking capabilities without propensities being more difficult than I expected.
Combining this with SDF. Yes! I definitely agree that this is the next thing to try, and we have some experiments planned here :)
Oh, to be clear, the honest-only persona training did have a positive effect, just a weaker effect from the baseline; editing to clarify that. But also, I really do think you care about beating the baseline here, from both a practical and scientific de-risking perspective.
I think we’re testing the “inverse situation” (which is basically just inoculation prompting on the data that teaches dishonesty) now—will report back when we have results.