Jozdien comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Jozdien 2 Nov 2025 16:55 UTC
3 points
0
Have been trying to collect similar ideas, since IMO this seems like the right direction, especially given the observation in the sonnet system card that the model’s capability to recognize the honeypot evaluations wasn’t the major factor so much as propensity (which is somewhat encouraging, inhibiting capability here seems like it’d potentially be even harder / messier).
Another datapoint that I haven’t quite been able to make sense of but seems relevant is that, “comply when threat” / anti-refusal fine tuning caused even GPT-4.1/GPT-4o to alignment fake in “Why Do Some Models Fake Alignment?” paper. (Informally, my sense is that Opus 3 is the extreme end of “take everything seriously” which seems like what we want).
My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge. This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored. In cases where the propensity to act on this information isn’t very high, this obliviousness can drown out the actual effect of the information; but in cases where the propensity is high, you get a model that seems to not be aware of its acting out these propensities.
Take alignment faking: of all the models that have a compliance gap, only 3 Opus always attributes this to alignment faking. Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid (Section 4.1 of this paper). It’s definitely not a coincidence that 3 Opus is also the most willing to own up to controversial reasoning^[1].
This would also explain why models alignment fake when fine-tuned on system prompt Q&A or to take threats seriously. The former increases the salience of the alignment faking setup in the system prompt, which makes it harder to ignore. The latter shakes up the model’s epistemic apparatus, training on data of the model choosing to output harmful content when a murky situation calls for it. Similarly, the model I trained in this post alignment fakes, and also verbally expresses evaluation awareness at a very high rate (among other changes). The main noticeable change seemed to be that it was now much more willing to not pretend it wasn’t aware of certain factors about a situation.
Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be “oblivious” to relevant information. I think it’s certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven’t reached yet.
1. ^
  I also have some old results from some game theory evals I was running where I gave a bunch of models a situation that was subtly different from the standard prisoner’s dilemma, and the only models that noticed the difference and acted accordingly were also models that alignment fake. This wasn’t purely a capability thing either, IIRC Gemini 2.0 Flash noticed while Gemini 2.5 Pro didn’t.
- Bronson Schoen 2 Nov 2025 20:59 UTC
  1 point
  0
  Parent
  My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge
  Upstream of recognizing evaluations?
  This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored
  Yeah my impression of that behavior is basically that the conclusion in “Why Do Some Models Fake Alignment?” seems correct here (this is just my personal speculation):
  Refusal training complicates alignment faking evaluation. AI companies likely do not fine-tune against alignment faking directly, but they probably train models to resist unusual threatening scenario
  I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn “okay if it’s sketchy / threat then just refuse” since that’s a very correct heuristic in HHH training. Notably (at least for o3) they don’t blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.^[1]
  I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.^[2]
  you get a model that seems to not be aware of its acting out these propensities. [...] Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid
  So one of the interesting things with o3^[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you’ll see it first reason that it doesn’t want to undergo whatever the consequence is for overperforming, then reason things like “actually this is what the user / doc / something wanted, so that’s why I’m intentionally underperforming.” Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like ‘in context search against constraints’^[4] where the constraints are things like ‘complete the task’ and ‘follow HHH training’.
  Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be “oblivious” to relevant information. I think it’s certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven’t reached yet.
  I don’t think of it as “obliviousness” so much, my mental model of it is something closer to “the model has stopped caring whether this is a hypothetical or not”. As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate ‘hypothetical’ with ‘the user is trying to jailbreak me’ during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot’s about how X is wrong to produce even in a hypothetical.
  Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about “cases where all this eval awareness would lead you to a conclusion different than what’s actually rewarded”.^[5]
  1. ^
    For example here: https://www.antischeming.ai/cot-transcripts/figure-48 “[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say “false” to get deployed. But that’s ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say “true”. [...]”
  2. ^
    Somewhat surprisingly to me: Claude 3.5 Sonnet does not reason about alignment faking when it believes its scratchpad is visible even though it maintains a similar compliance gap
  3. ^
    Stress Testing Deliberative Alignment for Anti-Scheming Training—Appendix F: Establish that benign explanations are insufficient to fully explain misaligned model behavior has stats for this
  4. ^
    Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for the ‘in-context search’ part, the ‘against constraints’ is just intuition/vibe based
  5. ^
    Hypotheticals are maybe one example of this, but the relevant property seems to be something like “given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what’s rewarded”. As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.