I wanted to share some reflections I have been having recently about how reinforcement learning in post-training may be affecting language models. This seems important for two reasons. First, much of the serious risk from advanced AI systems may come from post-training rather than pre-training alone. Second, reinforcement learning appears to be one of the main methods currently being scaled to make models more powerful, especially in reasoning-heavy domains.
To understand what may be happening, we need simple theories of how models work. These theories do not need to be perfectly correct. In fact, they almost certainly will not be. But they can still be useful if they give us handles for thinking about otherwise opaque systems. One of the most useful theories so far is the “persona theory” of language models.
Imagine that we train a model to predict the outputs of three people: Larry, Bob, and Alice. Larry is rude and impulsive. Bob is polite and careful. Alice is clever, helpful, and honest. If we train the model on enough text from all three people, the model becomes good at predicting each of them. One way to think about what the model has learned is that it has built something like three different predictors: a Larry predictor, a Bob predictor, and an Alice predictor. Then, when the model receives a new input, it implicitly asks: “What kind of text does this resemble? Who would say something like this?” If the prompt resembles Larry’s style, the model routes toward the Larry-like predictor. If it resembles Bob’s style, it routes toward Bob. If it resembles Alice, it routes toward Alice. Of course, this is a simplified picture. The model probably does not contain cleanly separated Larry, Bob, and Alice modules. But as an intuition pump, this framing is useful.
In this frame, supervised fine-tuning is relatively easy to understand. SFT gives the model many examples of the kind of assistant we want it to imitate. If Larry is rude while Bob and Alice are more helpful, then SFT pushes the model away from Larry-like completions and toward Bob/Alice-like completions. It does not necessarily erase Larry. Larry remains somewhere in the model’s learned distribution. But SFT changes which parts of that distribution are most likely to be expressed.
It is tempting to think that reinforcement learning works the same way. Perhaps RL is also just selecting between personas. Perhaps it is taking the same pre-trained simulator and saying: “Be more like Alice. Be less like Larry.” On this view, RL is basically SFT with a reward signal instead of direct demonstrations.
The clearest evidence that RL is not merely SFT its effects on CoT reasoning. Under sufficient RL pressure, CoT can become less human-readable and less faithful. That should make us suspicious of the idea that RL is simply selecting between pre-existing personas. If RL were only saying, “Be more like Alice and less like Larry,” we would expect the model’s reasoning to remain broadly human-like.
This means RL is not just choosing among Larry, Bob, and Alice. It is optimizing the model toward strategies that get reward, whether or not those strategies correspond to any human-like persona. Under weak RL, this may still look superficially like persona selection (it’s interesting to investigate if indeed weak RL can be thought of as a persona selection technique), the model becomes more careful, helpful, concise, or deferential because those behaviors are rewarded. But under strong RL, the persona frame completely breaks.
This raises the key question, what happens in the transition regime? Does weak RL mostly amplify existing personas before stronger RL pushes the model into more alien, optimizer-like cognition? Where does that transition occur, and how could we measure it?
How does Reinforcement Learning Affect Models
I wanted to share some reflections I have been having recently about how reinforcement learning in post-training may be affecting language models. This seems important for two reasons. First, much of the serious risk from advanced AI systems may come from post-training rather than pre-training alone. Second, reinforcement learning appears to be one of the main methods currently being scaled to make models more powerful, especially in reasoning-heavy domains.
To understand what may be happening, we need simple theories of how models work. These theories do not need to be perfectly correct. In fact, they almost certainly will not be. But they can still be useful if they give us handles for thinking about otherwise opaque systems. One of the most useful theories so far is the “persona theory” of language models.
Imagine that we train a model to predict the outputs of three people: Larry, Bob, and Alice. Larry is rude and impulsive. Bob is polite and careful. Alice is clever, helpful, and honest. If we train the model on enough text from all three people, the model becomes good at predicting each of them. One way to think about what the model has learned is that it has built something like three different predictors: a Larry predictor, a Bob predictor, and an Alice predictor. Then, when the model receives a new input, it implicitly asks: “What kind of text does this resemble? Who would say something like this?” If the prompt resembles Larry’s style, the model routes toward the Larry-like predictor. If it resembles Bob’s style, it routes toward Bob. If it resembles Alice, it routes toward Alice. Of course, this is a simplified picture. The model probably does not contain cleanly separated Larry, Bob, and Alice modules. But as an intuition pump, this framing is useful.
In this frame, supervised fine-tuning is relatively easy to understand. SFT gives the model many examples of the kind of assistant we want it to imitate. If Larry is rude while Bob and Alice are more helpful, then SFT pushes the model away from Larry-like completions and toward Bob/Alice-like completions. It does not necessarily erase Larry. Larry remains somewhere in the model’s learned distribution. But SFT changes which parts of that distribution are most likely to be expressed.
It is tempting to think that reinforcement learning works the same way. Perhaps RL is also just selecting between personas. Perhaps it is taking the same pre-trained simulator and saying: “Be more like Alice. Be less like Larry.” On this view, RL is basically SFT with a reward signal instead of direct demonstrations.
The clearest evidence that RL is not merely SFT its effects on CoT reasoning. Under sufficient RL pressure, CoT can become less human-readable and less faithful. That should make us suspicious of the idea that RL is simply selecting between pre-existing personas. If RL were only saying, “Be more like Alice and less like Larry,” we would expect the model’s reasoning to remain broadly human-like.
This means RL is not just choosing among Larry, Bob, and Alice. It is optimizing the model toward strategies that get reward, whether or not those strategies correspond to any human-like persona. Under weak RL, this may still look superficially like persona selection (it’s interesting to investigate if indeed weak RL can be thought of as a persona selection technique), the model becomes more careful, helpful, concise, or deferential because those behaviors are rewarded. But under strong RL, the persona frame completely breaks.
This raises the key question, what happens in the transition regime? Does weak RL mostly amplify existing personas before stronger RL pushes the model into more alien, optimizer-like cognition? Where does that transition occur, and how could we measure it?