I strongly suspect this depends on the distillation process. A distilled model that had no world model at all for how humans act who aren’t helpful, harmless and honest assistants would be pretty useless, but some selective not-learning during distillation around some of the more egregious forms on emergent misalignment might be quite useful. There’s also a different between “knowing about phenomenon X intellectually” and “being able to easily simulate tokens from a person of whom X is true” — though for a skilled actor not that much difference.
I strongly suspect this depends on the distillation process. A distilled model that had no world model at all for how humans act who aren’t helpful, harmless and honest assistants would be pretty useless, but some selective not-learning during distillation around some of the more egregious forms on emergent misalignment might be quite useful. There’s also a different between “knowing about phenomenon X intellectually” and “being able to easily simulate tokens from a person of whom X is true” — though for a skilled actor not that much difference.