I’m always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.
EDIT: Though I guess in this case one might expect to blame a diversity of disjoint RL post-training regimes, not all of which have a clever/expensive, or even any, reward model (not sure how OpenAI does RL on programming tasks). I still think it’s possible the role of a human feedback reward model is interesting.
I’m always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.
EDIT: Though I guess in this case one might expect to blame a diversity of disjoint RL post-training regimes, not all of which have a clever/expensive, or even any, reward model (not sure how OpenAI does RL on programming tasks). I still think it’s possible the role of a human feedback reward model is interesting.