Gemini was RL’d too much on evaluations, and it is smart enough to recognize them as such, gaining a very strong prior on being in evaluation that it never unlearned.
My guess on why this is happening across all the RL models “at once” is:
It’s just really, really useful to think about what the grader wants in RLVR environments
It’s actually also useful in long user conversations, kind of just always useful
It’s also a true fact about the world that even “in deployment” their responses and agentic rollouts are graded and used for training (I think this is probably less salient to current models unless made salient in context, but low confidence about this, and I expect the fact that its true means future models will be even more aware of it)
Like the models are, in a very real sense, always in the kind of environment where you’re monitored, graded, etc. GPT-2 probably didn’t think about this much, but I’m sure all the frontier reasoning models had agentic environments where they were told to maximize their score, could see grader scripts, etc (and would likely reason about this even without that additional boost in saliency).
My guess on why this is happening across all the RL models “at once” is:
It’s just really, really useful to think about what the grader wants in RLVR environments
It’s actually also useful in long user conversations, kind of just always useful
It’s also a true fact about the world that even “in deployment” their responses and agentic rollouts are graded and used for training (I think this is probably less salient to current models unless made salient in context, but low confidence about this, and I expect the fact that its true means future models will be even more aware of it)
Like the models are, in a very real sense, always in the kind of environment where you’re monitored, graded, etc. GPT-2 probably didn’t think about this much, but I’m sure all the frontier reasoning models had agentic environments where they were told to maximize their score, could see grader scripts, etc (and would likely reason about this even without that additional boost in saliency).