I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.
I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.