eggsyntax comments on Broadening the training set for alignment

eggsyntax 7 Jan 2026 20:35 UTC
4 points
0
I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
Claude 3.7 Sonnet is extremely good at identifying that it’s in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.
There’s been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)