Post-trained models have both a default assistant persona, and a set of other personas they are able and willing to play. How can we find the shape of the model in persona space, and the attractors in that space?
I think there are two overlapping effects here: one is that the HHH assistant persona (or it might be more accurate to say a specific narrow distribution of them) becomes the default persona that is normally simulated/run next if the context ends with a specific set of tokens along the lines of “\n\nAssistant: ” (short of contexts containing a successful intentional persona jailbreaking or a conversation causing unintentional persona drift, or a request for the assistant to roleplay, where that persona gets distorted or overlaid) and it gains a lot more detail and definition (the Claude persona being a particularly fine example of the latter). As we saw in the paper “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” training an LLM to conditionally reply as a particular persona seems to be pretty easy, for example their “Hitler, but only within tags” example (Section 4.2).
The second is the effect demonstrated in the “Empirical observations” section of The Persona Selection Model post, and that I described as “The Puppeteer” in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor: the output of the LLM, even across personas that are not the HHH assistant persona, acquires a kind of “authorial bias” or slant towards acting in ways that the post-training process encouraged (both the instruct and HHH alignment training, and the reasoning training).
I think there are two overlapping effects here: one is that the HHH assistant persona (or it might be more accurate to say a specific narrow distribution of them) becomes the default persona that is normally simulated/run next if the context ends with a specific set of tokens along the lines of “\n\nAssistant: ” (short of contexts containing a successful intentional persona jailbreaking or a conversation causing unintentional persona drift, or a request for the assistant to roleplay, where that persona gets distorted or overlaid) and it gains a lot more detail and definition (the Claude persona being a particularly fine example of the latter). As we saw in the paper “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” training an LLM to conditionally reply as a particular persona seems to be pretty easy, for example their “Hitler, but only within tags” example (Section 4.2).
The second is the effect demonstrated in the “Empirical observations” section of The Persona Selection Model post, and that I described as “The Puppeteer” in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor: the output of the LLM, even across personas that are not the HHH assistant persona, acquires a kind of “authorial bias” or slant towards acting in ways that the post-training process encouraged (both the instruct and HHH alignment training, and the reasoning training).