Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.
Even assuming you had a way to test this generalization’s efficacy, my understanding is that it seems exactly what you propose in this section was done with Claude Sonnet 4.5 with the effect of increasing evaluation awareness.
IME measurement / the “hold out” set is the whole hard part here.
I didn’t realize that’s what happened. I assumed it was accidentally pretrained on similar evals. But you’re right. It looks like they did try to aim alignment training at important behavior in pretty much the way I’m suggesting.
This seems like really bad news for alignment in general. Avoiding training on the important stuff, and hoping unimportant stuff generalizes to cover it does not sound like the greatest strategy.
I wonder if one issue is that they’re training on relatively few examples of ethical behavior in complex situations. These few examples aren’t that diverse in form, so they’re likely to be categorized as training/evals. Doing more training with more diversity of context (the second standard approach) should help.
But I’m not sure how much. It does seem like the attempt pretty much backfired. And that’s pretty bad news.
So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.
With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we’re worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?
I believe Gemini 3 is downright paranoid about being in evaluations when it’s not. I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
This is slightly better than the model recognizing exactly when it’s in training/evals, but not much; we might hope it assumes it’s being evaluated when the decisions are unique and important, but I don’t want to count on that happening enough of the time.
I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
Claude 3.7 Sonnet is extremely good at identifying that it’s in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.
Even assuming you had a way to test this generalization’s efficacy, my understanding is that it seems exactly what you propose in this section was done with Claude Sonnet 4.5 with the effect of increasing evaluation awareness.
IME measurement / the “hold out” set is the whole hard part here.
I didn’t realize that’s what happened. I assumed it was accidentally pretrained on similar evals. But you’re right. It looks like they did try to aim alignment training at important behavior in pretty much the way I’m suggesting.
This seems like really bad news for alignment in general. Avoiding training on the important stuff, and hoping unimportant stuff generalizes to cover it does not sound like the greatest strategy.
I wonder if one issue is that they’re training on relatively few examples of ethical behavior in complex situations. These few examples aren’t that diverse in form, so they’re likely to be categorized as training/evals. Doing more training with more diversity of context (the second standard approach) should help.
But I’m not sure how much. It does seem like the attempt pretty much backfired. And that’s pretty bad news.
So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.
With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we’re worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?
I believe Gemini 3 is downright paranoid about being in evaluations when it’s not. I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
This is slightly better than the model recognizing exactly when it’s in training/evals, but not much; we might hope it assumes it’s being evaluated when the decisions are unique and important, but I don’t want to count on that happening enough of the time.
Claude 3.7 Sonnet is extremely good at identifying that it’s in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.
There’s been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)