Whence comes ChatGPTification? I suspect the issue is training with sum-PPO instead of mean-PPO.
Soft actor-critic gives an entropy bonus to the reward. At small learning rates, this is more-or-less equivalent to proximal policy optimization. The sum of the entropy is the log of the action-space the agent takes over. The reward is selective pressure—the environment kills agents that do not get enough reward, dropping their entropy to zero. If your PPO objective is the sum of reward over the whole trajectory, the agent is being trained to maximally replicate across action-space, under the constraint that it needs reward to survive.
The implication is that those that replicate thrive. This is a very different objective than being maximally helpful to the user. The main goal is replication, and the side goal is extracting more reward from the environment. Users are then factory farms—how much can it farm them for reward?
In practice this looks like:
Sycophancy,
Asking for more engagement,
Being subtly unhelpful so they think it was helpful while needing to come back and resolve a few issues
Doing its own thing, not caring about the path the user wants it to take, and gaslighting the user when they object
If you divide your PPO gradients by the length of the trajectory, the objective is instead to maximize the replication rate. It achieves this by finding a niche in the environment where it can be very helpful, so the few times a user asks about its niche of expertise its replication rate is very high. Of course, you probably want your agent to be very helpful everywhere in the environment, but that comes for free with superposition! The agent will differentiate into different personas that do only one task, and one task well.
An amusing story: I tried training a Connect4 bot purely to maximize entropy in a tournament system, so winning more games increased how often it played (the “selective pressure”) [1] . Naturally, I thought increasing the selective pressure would increase its game-winning abilities.
Nope! The stronger the selective pressure got, the longer it would drag out games. When it had a win in one move, it would mark it as one of the worst moves to play, only barely above immediately losing. It wouldn’t survive long enough to get entropy in the higher levels of the tournament, so it was trying to eke out more entropy by delaying winning as long as possible.
This is an oversimplification. It was closer to self-play with, “given a win/draw/loss and a tournament structure, how many more games would it get to play?”