One thing I am curious about is whether the full set of personas transfers over with distillation or only the main “Assistant persona”? If it leans towards the latter, does that suggest that models which are fully pre-trained via distillation are by default more aligned?
I strongly suspect this depends on the distillation process. A distilled model that had no world model at all for how humans act who aren’t helpful, harmless and honest assistants would be pretty useless, but some selective not-learning during distillation around some of the more egregious forms on emergent misalignment might be quite useful. There’s also a different between “knowing about phenomenon X intellectually” and “being able to easily simulate tokens from a person of whom X is true” — though for a skilled actor not that much difference.
One thing I am curious about is whether the full set of personas transfers over with distillation or only the main “Assistant persona”? If it leans towards the latter, does that suggest that models which are fully pre-trained via distillation are by default more aligned?
I strongly suspect this depends on the distillation process. A distilled model that had no world model at all for how humans act who aren’t helpful, harmless and honest assistants would be pretty useless, but some selective not-learning during distillation around some of the more egregious forms on emergent misalignment might be quite useful. There’s also a different between “knowing about phenomenon X intellectually” and “being able to easily simulate tokens from a person of whom X is true” — though for a skilled actor not that much difference.