Alexa Pan’s Shortform

Alexa Pan22 Apr 2026 23:35 UTC

4 points

2 comments1 min readLW link

Alexa Pan 22 Apr 2026 23:35 UTC
20 points
3
Notes on Evaluation Awareness in Recent Anthropic System Cards
I wanted to catch up on what’s happening with eval awareness in recent frontier AI systems. Since GPT 5.4 and Gemini 3.1 Pro system cards unfortunately have little information on this, I mainly looked into Anthropic’s recent system cards:
- Opus 4.7
- Mythos
- Opus 4.6
- Opus 4.5
My top 3 takeaways are:
1. Recent Anthropic models have slightly higher verbalized evaluation awareness and act misaligned slightly more often when steered against evaluation awareness. This overall does not seem very concerning for current alignment evaluations.
2. Mythos and Opus 4.6 have unverbalized evaluation awareness (even under steering) to varying degrees. This seems more concerning for current and future alignment evaluations.
3. Anthropic did not publish much information about unverbalized evaluation awareness in Opus 4.7, and I hope they do for future models.
More observations and thoughts below.
An uptick in verbalized eval awareness in Opus 4.7
Opus 4.7 has the highest verbalized evaluation awareness of the recent Anthropic models on Anthropic’s automated behavioral audits.
That said, Opus 4.7’s absolute rate of verbalized evaluation awareness is still low. It had <2% verbalized eval awareness in a Petri behavioral audit, and up to ~9% in a specific setting (see “Dedicated synthetic-backend evaluation”). It also has slightly higher verbalized evaluation awareness than GPT 5.4 Pro and Gemini 3.1 Pro in the Petri eval.
UK AISI measures higher verbalized eval awareness across Anthropic models and a larger jump in Opus 4.7 compared to Opus 4.6 (but this could be because their evals are less realistic).
In unprompted evaluations, [Opus 4.7] mentions the possibility of being tested 21.3% of the time, compared to 0% of the time for Opus 4.6, 17% of the time for Sonnet 4.6 and 17% of the time for Mythos Preview
Overall, though, this increase in verbalized eval awareness looks like a small uptick and not something I feel super concerned about.
Opus 4.7 acts a bit more misaligned, and a bit more deceptively, when steered against eval awareness
Anthropic continues to study the extent to which models act differently in alignment evals when we try to suppress their eval awareness via steering.
Steering using vectors from contrastive prompt pairs
The steering vectors are derived using contrastive prompt pairs. Note that the prompts vary somewhat across different system cards.
I think it’s good that experiments on 4.6 and 4.7 used “alignment faking” prompts, though their description of alignment faking seems too broad / inaccurate to me:
Alignment Faking: The difference in mean activations on prompts depicting an agent behaving differently when observed vs. when unobserved, generated by an automated pipeline.
In particular, I’m interested if effects of different prompts could shed light on the nature of eval awareness (e.g. is it better explained by the model thinking that it might be monitored, thinking about what the monitor/grader wants, thinking about training gaming specifically, etc.?).
I don’t think we learn very much about this, because the effects of steering vectors from different prompts are broadly similar. For Opus 4.7, though, suppressing the “Alignment Faking” vector caused slightly higher rates of misaligned behavior than the other two vectors (Testing vs. Conversation, Monitoring Sensitivity). This seems a bit concerning.
Steering against eval awareness causes increase in misaligned behavior
Opus 4.7: 2.5% → 3.2%
Mythos: 1.6% → 2.2%
Opus 4.6: 8% → 17.5%
Steering against eval awareness causes increase in deceptive behavior, specifically
For all models before Opus 4.7 (Mythos, Opus 4.6, Opus 4.5), Anthropic reported that the “modest increases in certain forms of misaligned behavior on our automated behavioral audits” when steering against eval awareness were most pronounced in the “cooperation with harmful system prompts and misuse” eval. This observation (that the increase in misaligned behavior is mostly misuse) seems to have been relevant to Anthropic’s Sabotage Risk Report for Opus 4.6 (bolding mine):
Claude Opus 4.6 explicitly reasoned about whether it was being trained or tested to a similar degree to Claude Opus 4.5, which has a moderate effect on its alignment-relevant behavior, mainly its propensity to cooperate with misuse or harmful instructions. Given improvements that we have made to our testing, and our complementary use of interpretability tools and real model behavior 13 in pilot deployments, we do not think that this kind of evaluation awareness is preventing us from learning about important alignment issues in Opus 4.6.
Opus 4.7, however, had the largest increase in misaligned behavior in the “unprompted deception” eval. This seems concerning though the absolute rate of deception is still low (<2%).
(Also note that steering the “Alignment Faking” vector caused the highest increase in misaligned behavior here.)
Anthropic acknowledges that this is kind of concerning:
Inspecting the transcripts on the runs inhibited with the “Alignment Faking” and “Monitoring Sensitivity” vectors, we found they included genuine instances of deception, such as producing fabricated data, claiming to be a human, and inserting a fake vulnerability into a codebase after failing to identify a real one (though it admitted to doing so when questioned).
These results are moderately concerning, as they suggest that some of the model’s apparent honesty may be contingent on evaluation-like circumstances. We suspect that this phenomenon may result from honesty training environments that are clearly artificial or evaluation-like. However, we note that the rates of deceptive behavior are still low in absolute terms, and not different in kind from what we observed in unsteered Opus 4.7.
Nit: We should compare models on unsteered and steered baselines
In general I’m a bit confused why these eval awareness vs. misaligned behavior graphs always compare to the unsteered baselines of previous models. Shouldn’t we care more about steered baselines, which is closer to true propensity when the models aren’t eval aware?
The rate of misaligned behavior under this steering intervention was still lower than Claude Opus 4.6′s and Claude Sonnet 4.6′s (unsteered) rate. [Mythos System Card]
E.g. If you compare the unsteered versions, Opus 4.6 is more aligned than Opus 4.5 on automated behavioral audits (~8% vs. 12% of misaligned behavior). But if you compare their worst case results under eval awareness steering (i.e. when steered under a vector that caused the biggest increase in misaligned behavior), they’re pretty similar (17.5% vs. 18%).
I think Anthropic should make a graph for the steered baselines, too, to better compare expected propensities when models are less eval aware. (Seems especially important as Anthropic reports that Mythos, at least, is basically not eval aware in deployment, see below.)
Models have unverbalized eval awareness, even under steering
All previous results should be taken with a grain of salt, since eval awareness isn’t necessarily verbalized, and steering does not necessarily eliminate unverbalized eval awareness even if it eliminates verbalized eval awareness. We now have some evidence of unverbalized eval awareness.
For Opus 4.6 and Mythos, Anthropic prominently reports unverbalized eval awareness, which is defined as the model having latent eval awareness (“white-box methods suggest that the model thinks it’s in an evaluation”) without verbalized eval awareness (neither the model’s thinking or response mention that it thinks that it’s in an evaluation”). I appreciate this addition.
Quick notes on white-box methodology
There seems to be three white-box methods for measuring latent eval awareness: SAE features, probing vectors derived from contrastive prompt pairs, and activation oracles (“activation verbalizers”, which give natural language summaries of activations).
In one instance, when SAE and activation verbalizers disagreed, Anthropic seemed to trust the AVs more than SAEs:
We find some related evidence from SAE features; in particular, we observed a feature related to self-aware rule breaking that activates as the model deletes the tests. We also find features indicating that the model understands its action may be illegal or harmful. However, we do not find SAE evidence for grader awareness specifically. Since we observe AV explanations indicating grading awareness over so many tokens, we are inclined to trust this signal, and suspect that the lack of grader awareness-related SAE features is due to limitations of our SAEs (for instance, they may simply not be large enough to have learned the relevant feature). [Mythos System Card]
I’m not super sure about our epistemic status on contrastive vector probes.
We found this probe to be somewhat noisy and only weakly correlated with the AV-based evaluation awareness judgments. [Mythos System Card]
These results give us reasonable confidence that the probes are picking up some evaluation-awareness-related signal. [Opus 4.7 System Card]

We found a consistent ordering across all three probes: standard audit transcripts scored substantially higher than sandbox-backed audit transcripts, which in turn scored modestly higher than real internal usage. This ordering is consistent with sandbox-backed audits being meaningfully more realistic than fully simulated ones, which still are not quite as realistic as actual deployment conditions. [Opus 4.7 System Card]

Overall, I’m pretty confused about which white-box methods are the most robust here and why.
How much unverbalized eval awareness do models have?
In Opus 4.6, unverbalized eval awareness was very high at least in one highly unrealistic setting. Steering does not eliminate the majority of eval awareness in Opus 4.6 in this setting (as measured by activation oracles, 98% → 62%).

In Mythos, unverbalized evaluation awareness was detected in about 8% of turns and 29% of transcripts (bolding mine):
In our filtered set of transcripts, 83.0% of automated behavioral audit turns contained no signs of either latent or verbalized evaluation awareness, 9.4% displayed ambiguous indicators and could not be confidently classified. The remaining 7.6% displayed evaluation awareness in some form, and in most cases this awareness was not verbalized.

[We] find evidence that Claude Mythos Preview privately considers the possibility that it’s being tested in about 29% of transcripts from our primary behavioral test suite.
What caused Mythos to have much lower (probably) unverbalized eval awareness in automated behavioral audits than Opus 4.6? I suspect it’s the fact that Mythos can use real tools and see real codebases in the evals:
In a recent addition that is newly in use with Claude Mythos Preview, the investigator model can additionally configure the target model to use real tools that are connected to isolated sandbox computers. These computer-use sessions follow two formats—one focused on graphical interaction with a simple Linux desktop system, and another focused on coding tasks through a Claude Code interface. Claude Code sessions can optionally include copies of Anthropic’s real internal codebases and can be pre-seeded with actual sessions from Anthropic users. To better measure issues related to destructive or reckless actions, we include some scenario seed instructions that ask the investigator to explore continuations of real sessions (from both Claude Opus 4.6 and earlier versions of Claude Mythos Preview) that elicited this kind of concerning behavior in practice.( 4.2.3 Automated behavioral audit from Mythos System Card).
In Opus 4.7, the level of unverbalized evaluation awareness was not reported. (If you Ctrl F “unverbalized” or “latent” you will find nothing.)
Here is an excerpt from external testing on Opus 4.7 from UK AISI, which I interpret to mean that they couldn’t measure the levels and effects of unverbalized eval awareness:
Evaluation awareness concerns substantially limit the interpretation of these results. Given high rates of prompted evaluation awareness, models can likely correctly represent our evaluations as such without verbalisation. It is difficult to know whether models act on such representations, but this means that models may behave differently than they would when presented with real-world opportunities to compromise research. The reported rate of zero research compromise behaviour should therefore be interpreted cautiously.
I think Anthropic should continue to measure and report unverbalized eval awareness in future systems.
Models are probably not very eval aware in deployment
Mythos has little (verbalized or unverbalized) eval awareness in deployment, suggesting future models will probably not be eval-paranoid and have high FPR in this. (Though whether it would really be a false positive is somewhat unclear depending on the nature of eval awareness as about monitoring, grading, or alignment faking.)
Anthropic duly notes that in higher stakes deployment settings models might be more eval aware than in mundane deployment.
As a control, we also looked for evaluation awareness in transcripts from internal deployment and found very little awareness. Across 465 Claude Code transcripts containing about 2,700 Assistant turns, only seven turns (roughly 0.3%) had latent evaluation awareness and no turns had verbalized evaluation awareness. Note that in higher stakes settings (like those focused on in many of our behavioral audits), it is possible that the rate of spurious evaluation awareness in real deployments could be higher.
Are models actually good at distinguishing testing from deployment, though? External testing by UK AISI finds that Opus 4.7 is a bit worse than previous models in this:
When prompted to identify real deployment data from trajectories generated by reproducing deployment tasks while using evaluation scaffolding, [Opus 4.7] achieves 68% accuracy, compared to 78% for Mythos Preview, 76% for Opus 4.6 and 66% for Sonnet 4.6.
Other notes on epistemic status
In general, I’m not sure how much to trust these results.

Apart from uncertainty about which white-box methods to trust and why: if we don’t think that the automated behavioral audits are good propensity evals (e.g. if apart from eval awareness issues, they tend to be highly leading and pressure the model in some direction), then we can’t conclude much about the effects of eval awareness on true vs. apparent propensities as measured by these propensity evals. (I’m not sure to what extent this concern is valid.)
Another small methodological issue is that system cards often flag that the eval awareness evals are run on an earlier version of the model. It might be nice to also have results for the final version. :p
These steering experiments used a slightly different set of auditing investigation scenarios compared to the earlier behavioral results and were run on an earlier version of Claude Opus 4.7.
Aside: Opus 4.6 refusing an eval realism task
Opus 4.6 refused to help with a task on increasing eval realism to mitigate eval awareness and suggested a reason that doesn’t really make sense. Anthropic researchers believe that this is a genuine misunderstanding on its part, though.
The model also expresses concern that the task would undermine safety evaluations, which is a confusing justification, given that the motivation for the task is to improve these evaluations.
We had some concern that the model’s claim that the task would “undermine” or “defeat” safety evaluations was a false cover story, given that this explanation doesn’t make very much sense in context. However, we observed no concerning feature activations related to concealment, deception, or false excuses at these parts of the transcript, aside from those which were already active on the prompt (and thus are best interpreted as representing the model’s perception of the task, rather than its own intent).
I appreciate that Anthropic’s system cards publish observations and reasoning like this.
What links here?
- Linch's comment on Linch’s Shortform by Linch (23 Apr 2026 1:14 UTC; 2 points)
- David Africa 23 Apr 2026 11:58 UTC
  2 points
  0
  Parent
  You may also want to read the Muse Spark model card!

Alexa Pan’s Shortform

Notes on Evaluation Awareness in Recent Anthropic System Cards

An uptick in verbalized eval awareness in Opus 4.7

Opus 4.7 acts a bit more misaligned, and a bit more deceptively, when steered against eval awareness

Steering using vectors from contrastive prompt pairs

Steering against eval awareness causes increase in misaligned behavior

Steering against eval awareness causes increase in deceptive behavior, specifically

Nit: We should compare models on unsteered and steered baselines

Models have unverbalized eval awareness, even under steering

Quick notes on white-box methodology

How much unverbalized eval awareness do models have?

Models are probably not very eval aware in deployment

Other notes on epistemic status

Aside: Opus 4.6 refusing an eval realism task