Great to see academic / open work on character training.
While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.
They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)
[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)
Great to see academic / open work on character training.
While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.
They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)
https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews