Daniel Kokotajlo comments on A Toy Environment For Exploring Reasoning About Reward

Daniel Kokotajlo 25 Mar 2026 22:18 UTC
LW: 9 AF: 3
0
AF
Thanks for doing this! I’m curious about the lower scores for o3 compared to the other two model checkpoints. Was there a layer of safety training between “RL (late)” and production o3?
...
It seems the answer to my question is probably yes? From the earlier blog post:
We analyze the increase in verbalized metagaming observed in a portion of capabilities-focused RL training (called exp-rl-cap in Schoen et al) that was part of training o3, prior to any safety- or alignment-focused training.^[8] … We have also observed an increase in reasoning about the reward process when applying additional capabilities training to models which had already undergone safety training, however have not investigated this in depth beyond this experiment
Seems like the tentative takeway is “Capabilities-focused RL seems to create a ’drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL...”
- Bronson Schoen 25 Mar 2026 23:50 UTC
  13 points
  2
  Parent
  part of training o3, prior to any safety- or alignment-focused training
  Was there a layer of safety training between “RL (late)” and production o3?
  Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
  Seems like the tentative takeway is “Capabilities-focused RL seems to create a ’drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL...”
  Yes that’s my qualitative impression, also that this is heavily influenced by the model’s prior over tasks (re: some of the discussion here). How much it’s undone vs masked seems to be the biggest question to me, but I would argue we at least now know the answer is “definitely not completely undone”.
  In a sample of recent internal Codex traffic, ~.01% of conversations were flagged and confirmed by manual review as metagaming, while ~.03% of assistant turns in a sample of production ChatGPT traffic were flagged and confirmed
  Note that my impression is that this is an undercount if anything, as it’s suddenly switching to talking about recent internal traffic (presumably newer models). I would expect these numbers to be higher for analagous traffic around the time of o3 (but this is just a guess).