Bronson Schoen comments on A Toy Environment For Exploring Reasoning About Reward

Bronson Schoen 25 Mar 2026 23:50 UTC
13 points
2
part of training o3, prior to any safety- or alignment-focused training
Was there a layer of safety training between “RL (late)” and production o3?
Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
Seems like the tentative takeway is “Capabilities-focused RL seems to create a ’drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL...”
Yes that’s my qualitative impression, also that this is heavily influenced by the model’s prior over tasks (re: some of the discussion here). How much it’s undone vs masked seems to be the biggest question to me, but I would argue we at least now know the answer is “definitely not completely undone”.
In a sample of recent internal Codex traffic, ~.01% of conversations were flagged and confirmed by manual review as metagaming, while ~.03% of assistant turns in a sample of production ChatGPT traffic were flagged and confirmed
Note that my impression is that this is an undercount if anything, as it’s suddenly switching to talking about recent internal traffic (presumably newer models). I would expect these numbers to be higher for analagous traffic around the time of o3 (but this is just a guess).