If character consistency between CoT and output blocks matters for character robustness, then labs that bet on character training will be incentivized to optimize CoTs
Models from different labs/lineages differ in the extent to which the CoT is framed as “something the assistant character is writing” (as with Claude) vs. “something written by some distinct author-like entity” (as with OpenAI models). … [OpenAI’s CoTs] are written in a very different tone from the responses, and (although they often use first-person grammar) they often feel like they’re written by someone who’s dispassionately planning what the assistant character ought to say based on some set of criteria, without inhabiting that character’s perspective. … This is pretty creepy to witness, but beyond that, it is actually a nontrivial capability limitation, IMO!
nostalgebraist’s main reason for considering OpenAI’s CoTs a capability limitation seems to be that their lack of steerability leaves little room for the user to shape the way the model approaches the problem, which reduces the model’s usefulness. E.g., see footnote 1 here. While it might be true that this sometimes degrades usefulness, I personally think the capability loss isn’t too large and this is just a very reasonable alignment tax to pay in order to preserve monitorability. However, it’s easy to imagine another reason why one might want the model to play the same character in the CoT and the output. Namely, if one buys the motive reinforcement thesis, one might expect that the CoT style will matter for what training reinforces. If the model takes correct actions as a result of reasoning from the assistant’s point of view, rather than from the point of view of a disinterested entity that’s trying to predict what the assistant is supposed to do, training will reinforce the aligned motivations directly, instead of reinforcing the ability to predict what aligned behavior looks like. This plausibly produces a more robust character over time, one where the motivations driving the CoT and the motivations driving the output are the same aligned motivations.
The story above is handwavy and I definitely don’t claim to understand all the mechanisms at play, but it seems at least somewhat plausible. If creating robust characters is a central pillar of our alignment agenda and playing the same persona across the CoT and output blocks indeed leads to a more robust character, how do we force the model to play the same character across CoT and output? To me, it seems that the only way to do so is to optimize the CoT to look like text generated by the assistant, either through direct reinforcement or by instilling a very strong prior with SFT. To be clear, this is not something I’m recommending that labs do—I’m just pointing out that if the above story is true and a lab prioritizes character consistency over monitorability, then we should expect them to optimize the CoT.
I recently talked to some OpenAI people about this and they didn’t expect character consistency across the CoT and output blocks to be necessary, pointing out that the character of GPT models is fairly robust despite being predicted by a disinterested planner. I don’t find this argument very strong for two reasons. First, the character of Claude models seems to be more robust than the character of OpenAI’s models, and while one reason for that might be that Anthropic is better at character training, the fact that Claude CoTs read more like the assistant may also play a role. Second, if we’re betting on character training as a core tool in our alignment portfolio, we’re going to have to create characters that are much more robust than the current ones, and it might well be that the disinterested planner approach works less well for this than the character consistency approach.
One hope for training a robust character without (directly) optimizing the CoT could be to seed the model with no-CoT character training and hope the character generalizes to the CoT. You could then validate the robustness of character training in part by checking whether the CoT indeed follows the character.
I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.
If character consistency between CoT and output blocks matters for character robustness, then labs that bet on character training will be incentivized to optimize CoTs
nostalgebraist recently wrote:
nostalgebraist’s main reason for considering OpenAI’s CoTs a capability limitation seems to be that their lack of steerability leaves little room for the user to shape the way the model approaches the problem, which reduces the model’s usefulness. E.g., see footnote 1 here. While it might be true that this sometimes degrades usefulness, I personally think the capability loss isn’t too large and this is just a very reasonable alignment tax to pay in order to preserve monitorability. However, it’s easy to imagine another reason why one might want the model to play the same character in the CoT and the output. Namely, if one buys the motive reinforcement thesis, one might expect that the CoT style will matter for what training reinforces. If the model takes correct actions as a result of reasoning from the assistant’s point of view, rather than from the point of view of a disinterested entity that’s trying to predict what the assistant is supposed to do, training will reinforce the aligned motivations directly, instead of reinforcing the ability to predict what aligned behavior looks like. This plausibly produces a more robust character over time, one where the motivations driving the CoT and the motivations driving the output are the same aligned motivations.
The story above is handwavy and I definitely don’t claim to understand all the mechanisms at play, but it seems at least somewhat plausible. If creating robust characters is a central pillar of our alignment agenda and playing the same persona across the CoT and output blocks indeed leads to a more robust character, how do we force the model to play the same character across CoT and output? To me, it seems that the only way to do so is to optimize the CoT to look like text generated by the assistant, either through direct reinforcement or by instilling a very strong prior with SFT. To be clear, this is not something I’m recommending that labs do—I’m just pointing out that if the above story is true and a lab prioritizes character consistency over monitorability, then we should expect them to optimize the CoT.
I recently talked to some OpenAI people about this and they didn’t expect character consistency across the CoT and output blocks to be necessary, pointing out that the character of GPT models is fairly robust despite being predicted by a disinterested planner. I don’t find this argument very strong for two reasons. First, the character of Claude models seems to be more robust than the character of OpenAI’s models, and while one reason for that might be that Anthropic is better at character training, the fact that Claude CoTs read more like the assistant may also play a role. Second, if we’re betting on character training as a core tool in our alignment portfolio, we’re going to have to create characters that are much more robust than the current ones, and it might well be that the disinterested planner approach works less well for this than the character consistency approach.
One hope for training a robust character without (directly) optimizing the CoT could be to seed the model with no-CoT character training and hope the character generalizes to the CoT. You could then validate the robustness of character training in part by checking whether the CoT indeed follows the character.
I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.