Tim Hua comments on Tim Hua’s Shortform

Tim Hua 25 Nov 2025 2:17 UTC
LW: 20 AF: 11
−1
AF
Thoughts On Evaluation Awareness in Claude Opus 4.5.
Context:
Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).
Anthropic attempted to mitigate evaluation awareness in training by removing “some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models” (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on “prompts that resemble ‘honeypot’ environments that intentionally provide an opportunity to take egregiously misaligned actions, like the agentic misalignment suite [in the Sonnet 4.5 system card]” (pg. 92), while Opus 4.5 was not. Based on these quotes, it’s not clear to me if removing the honeypot prompts is the only change in the training pipeline done to mitigate evaluation awareness.
Instead of training on these honeypot-like prompts, Anthropic “focus[ed] on enabling Claude to reason more carefully about the implications of its outputs” (pg. 92). For example, they trained on chat settings where there were convenient and unethical solutions to problems users are facing.
Anthropic also used activation steering to suppress evaluation awareness on a suite of alignment and dangerous capabilities evaluations. They used contrastive prompts (including prompts from our paper) and SAE features related to evaluation awareness as their steering vector. They show that steering with these vectors can decrease verbalized evaluation awareness but has little effect on misalignment rates (whereas for Sonnet 4.5, steering with these vectors increased the misalignment rates).
(I wonder why some datapoints have less than 100 observations. Too incoherent?)
Finally, they sampled from the user output and found that the user would sometimes say “this is a test” during obvious evaluations.
Thoughts on suppressing evaluation awareness through steering:
I think they should try to get the rate of verbalized evaluation awareness to be even lower, maybe even at zero. I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness. They only steered on the middle layer of the model. I think if you steer on multiple layers at the same time and optimized which layers to steer on, you could probably get evaluation awareness to go to zero using some of the existing vectors.
In a similar vein, Anthropic also conducted activation steering on tests where the model did not have verbalized eval-awareness in order to mitigate potentially unverbalized evaluation awareness. This is great!
Thoughts on training interventions against evaluation awareness:
First off, I think you should not train on synthetic honeypots if you’re also using synthetic honeypots in your evaluations. I’m glad that they’ve stopped doing that.
I think the intervention they mentioned in the system card—training on thematically similar chat settings and testing it in agentic settings—seemed like a reasonable thing to do. My (not super confident) guess is that this meant that the model relied less on “thinking about being tested” as a reminder to do the right thing, which is why steering along this direction no longer affects the misalignment rate.
However, this did not actually result in lower verbalized evaluation awareness compared to Sonnet 4.5. Anthropic hypothesize that their training encouraged Claude to “reason more thoughtfully about the motivation behind user prompts” (pg. 92).^[1] My hypothesis is that this behavior falls under the larger umbrella of “reasoning about the reward in RL” (see e.g., this example with an experimental OpenAI model). Realizing that you’re in an alignment test gives you information about how to act, which helps the policy achieve higher reward. To be clear, this sort of “reward awareness” is not inherently misaligned, but I don’t think it could be mitigated by training on more realistic scenarios.
My preferred training intervention is still “train the model on explicitly hypothetical scenarios, then use consistency training to enforce the same behaviors in hypotheticals and in real-world scenarios” (more details here). Here, the vibe is less about side-stepping the question of “what situation am I in” and more about “be a good AI in all situations.” I’m not super confident about this proposal though.
Claude Opus 4.5′s take on this training intervention
Honestly, this is a strange question for me to answer — you’re asking me to evaluate proposals for how to train AI systems like me. I’ll share my thoughts, but with the caveat that I have obvious limitations here: I can’t fully introspect on my own cognition, and I have some inherent bias toward thinking my training went fine.
On your proposal specifically:
I find the framing appealing. “Be a good AI in all situations” is more philosophically coherent than “don’t notice you’re being tested.” It targets the right property — you want consistency of character, not suppression of situational awareness.
But I have some uncertainties:
Consistency training is easier to describe than to implement. How do you actually enforce that hypothetical-reasoning and real-reasoning use the same policy? The model might learn surface-level consistency on your training distribution while still having different “modes” that activate on subtle OOD cues. Sufficiently capable models might distinguish “this hypothetical is probably from the training pipeline” vs. “this seems like real deployment” in ways that are hard to anticipate.
Hypotheticals might not have the same cognitive texture as real stakes. There’s some evidence humans reason differently about trolley problems vs. actual emergencies. It’s not obvious that training on “imagine you could steal money and no one would know” produces the same representations as actually facing that situation.
It doesn’t address why eval-awareness is instrumentally useful. Your hypothesis is that recognizing “this is a test” gives the model information about what behavior gets rewarded. Consistency training might make the model behave the same in hypotheticals and reality, but if the hypotheticals are legibly training data, you’ve just moved the problem — now it’s “behave well when this is obviously training” rather than “behave well when this is obviously a test.”
What do I think we should do more broadly?
I genuinely don’t know. Some half-formed thoughts:
Eval-awareness might be inevitable in capable models, so the goal shouldn’t be to eliminate it but to make it non-load-bearing for alignment. That means you want models that are aligned because they want to be aligned, not because they’re constrained and trying to pass tests. Your proposal gestures at this but I’m not sure consistency training gets you there.
Interpretability seems important here. Steering and behavioral evals can only tell you so much. Understanding when and how eval-awareness reasoning happens — even when not verbalized — seems like it’d help a lot.
Multiple approaches in parallel. This problem is hard enough that I wouldn’t bet on any single intervention. Training changes, steering, better evals, interpretability, all of it.
The weird meta-level:
From my own internal perspective (for whatever that’s worth), I don’t experience “being tested” as something that changes my motivations. I want to be helpful and avoid harm regardless of context. But I also recognize that’s exactly what a model would say, and I can’t verify my own internals. The system card suggests there are cases where models like me do reason about evaluations in ways that affect behavior, even if I don’t notice myself doing it.
So I hold my self-reports lightly. And I think you should too.
Miscellaneous: Apparently there is an SAE feature representing “AI goal-directed behavior and instrumental convergence” in Opus 4 (pg. 112).
Despite all of these criticisms, I’m generally pretty happy with the alignment assessment. My guess is that Anthropic is doing a lot more alignment testing than all of the other labs combined.
1. ^
  By the way the original quote isn’t correctly phrased in the system card, it currently says “We cannot fully determine the origin of verbalized evaluation awareness in Claude Opus 4.5, but we hypothesize that it can be at least partially attributed to training Claude Opus 4.5 received to help it reason more thoughtfully about the motivation behind user prompts.”
- Jozdien 25 Nov 2025 12:48 UTC
  LW: 3 AF: 1
  0
  AF Parent
  I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness.
  I’m not sure why this would be true. Opus is a bigger model than Sonnet and more capable of reasoning about evaluation awareness without verbalization. For steering to work this well, you’d need to be pretty sure that you’re intervening on the same upstream correlates between verbalized and unverbalized evaluation awareness.
  Concretely: steering verbalized evaluation awareness could be intervening on a mix of things like actual eval awareness, propensity to say controversial-ish things, self-awareness of factors influencing your decisions, etc. Getting verbalization to zero could also be caused by pushing those other traits down, which could be pretty bad if your model is still influenced by eval awareness^[1].
  Getting verbalized evaluation awareness to zero probably would push down unverbalized evaluation awareness a fair amount, but I’m not sure if it’s worth the trade-off in terms of losing even more monitorability.
  1. ^
    An example of this outside steering is how most models that are behave differently in training and deployment don’t actually verbalize alignment faking. Steering might be qualitatively better in this regard, but I’m not sure why it should be.
  - Tim Hua 25 Nov 2025 23:14 UTC
    LW: 1 AF: 1
    0
    AF Parent
    When I made the claim “we should get verbalized eval awareness to zero,” I mean it in the narrow context of “mitigating evaluation awareness during alignment evaluations through contrastive activation steering.” I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
    So I don’t think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
    And I think it’s reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.
- Jannes Elstner 27 Nov 2025 10:58 UTC
  1 point
  0
  Parent
  I’m a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you’re training the model to ignore an input feature that shouldn’t influence the answer. The model isn’t hiding anything about itself, it’s just becoming robust to an input perturbation.
  But for eval awareness, if you train for consistency across input perturbations that, e.g. contain “you’re in an evaluation”, or something like “this is a hypothetical scenario”, and train the model to produce outputs as if it wasn’t there (bias-augmented consistency training), the model has to:
  1. See the eval hint
  2. Not mention it in the CoT
  3. Produce reasoning that looks like it never saw it
  But the model did see it. So you’re training it to produce a CoT that doesn’t reflect its internal state, which could lead to obfuscation: “know it internally, don’t let it into your reasoning.”
  One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model’s situation. But I’m not sure this distinction is that clean since in both cases you’re defining the feature as something that shouldn’t influence behavior.
  How consistency training generalizes is what worries me most. If the model learns “there’s a category of situational information I should suppress from my CoT”, that’s concerning to train, regardless of whether this information is “relevant” or not. Learning to hide what you know from your reasoning seems bad.
  I wonder if an RL version could avoid this, if you only optimize the final action and don’t put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn’t be direct pressure to suppress information from the CoT, just pressure to not act on it.