evhub comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

evhub 30 Oct 2025 20:58 UTC
34 points
11
One point: verbalized eval awareness seems much better than unverbalized eval awareness. So one worry I’d have is that trying very hard to avoid training environments which incentivize verbalized eval awareness could just lead to a bunch of unverbalized eval awareness instead. Maybe the fact that you see a bunch of verbalized eval awareness here is actually a really good thing that you want to preserve for monitoring and faithfulness purposes. Curious for thoughts on that question.
- ryan_greenblatt 31 Oct 2025 16:50 UTC
  8 points
  4
  Parent
  This seems non-crazy, but my current guess would be that the training for 4.5 (relative to 4) made both verbalized and unverbalized eval awareness more likely. E.g., I don’t have much confidence that 4.5 isn’t aware it is in an eval even if it doesn’t mention this in the cot (especially if it sometimes mentions this for pretty similar inputs or sometimes mentions this for this input). Like, it’s good that we know eval awareness is an issue in 4.5, but it’s hard to be confident you’ve avoided this issue in some eval by looking for verbalized eval awareness!
  
  (If we somehow became confident that 4.5 is eval aware if it verbalizes with some experiments/science that could change my perspective.)
  
  If you aggressively optimize for finding training environments that train against misalignment and which don’t increase verbalized eval awareness, it does seem plausible you find environments that make unverbalized eval awareness more likely. I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
  - Bronson Schoen 2 Nov 2025 21:20 UTC
    2 points
    0
    Parent
    I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
    
    I also agree with this approach, my worry is the incentive structures here are:
    
    It’s always going to be cheap to generate synthetic environments like this
    It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
    The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
    
    I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.
- Neel Nanda 31 Oct 2025 0:04 UTC
  7 points
  −3
  Parent
  Why would that happen?
  
  If you’re not training against the chain of thought, verbalizing the eval awareness should be neutral on the reward. So environments that don’t incentivise verbal eval awareness also shouldn’t incentivise non verbal
  - Fabien Roger 31 Oct 2025 3:35 UTC
    12 points
    7
    Parent
    An intuition here is that different RL environments could ~randomly change the “scratchpad personality” through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don’t result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.
    (In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)
- Daniel Kokotajlo 31 Oct 2025 16:55 UTC
  6 points
  3
  Parent
  In AI 2027 slowdown ending, the strategy taken to solve alignment is as follows:
  Step 1: Train and deploy Safer-1, a misaligned but controlled autonomous researcher. It’s controlled because it’s transparent to human overseers: it uses English chains of thought (CoT) to think, and faithful CoT techniques have been employed to eliminate euphemisms, steganography, and subtle biases.
  Step 2: Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
  Step 3: Train and deploy Safer-2, an aligned and controlled autonomous researcher based on the same architecture but with a better training environment that incentivizes the right goals and principles this time.
  Here is a brief incomplete list of techniques that might be incorporated into the better training environment:
  Limit situational awareness during some portions of training, to make alignment-faking much less likely.
  Leverage debate and other scalable oversight schemes for more reliably incentivizing truth.
  Relaxed adversarial training, red-teaming, and model organisms.
  Spend a higher fraction of the compute budget on alignment training (e.g. the sorts of things described above), e.g. 40% instead of 1%.
  Step 4: Design, train, and deploy Safer-3, a much smarter autonomous researcher which uses a more advanced architecture similar to the old Agent-4. It’s no longer transparent to human overseers, but it’s transparent to Safer-2. So it should be possible to figure out how to make it both aligned and controlled.
  Step 5: Repeat Step 4 ad infinitum, creating a chain of ever-more-powerful, ever-more-aligned AIs that are overseen by the previous links in the chain (e.g. the analogues of Agent-5 from the other scenario branch).
  I still think this is overall a decent high-level plan, possibly the best plan. (What are the best alternatives? Curious for answers to that question.)
  Anyhow, I bring this up as context. My answer to your question is basically “Yes, right now what we want most is interpretability/faithfulness/etc., so that we can learn as much as possible about the relationship between architecture/trainingenvironment/Spec/etc. and goals/propensities/etc. that the model ends up with. We need to do a ton of basic science to understand this relationship way better than we currently do. Our AIs will be misaligned while we are doing this, but it’s OK because we can monitor their thoughts and thereby prevent them from doing anything catastrophic. This is Step 1 and Step 2 from the above plan. Then, later, we can hopefully achieve Step 3.”
  - habryka 31 Oct 2025 21:28 UTC
    7 points
    3
    Parent
    (What are the best alternatives? Curious for answers to that question.)
    Don’t build AGI for a long time. Probably make smarter humans. Build better coordination technology so that you can be very careful with scaling up. Do a huge enormous amount of mechinterp in the decades you have.
    - Daniel Kokotajlo 31 Oct 2025 22:44 UTC
      5 points
      3
      Parent
      I was talking specifically about the alignment plan, the governance stuff is cheating. (I wasn’t proposing a comprehensive plan, just a technical alignment plan). Like, obviously buying more time to do research and be cautious would be great and I recommend doing that.
      
      I do consider “do loads of mechinterp” to be an answer to my question, I guess, but arguably the mechinterp-centric plan looks pretty similar. Maybe it replaces step 3 with retargeting the search?
      - habryka 31 Oct 2025 23:02 UTC
        3 points
        1
        Parent
        Hmm, I guess you mean “an alignment plan conditional on no governance plan?”. Which is IMO a kind of weird concept. Your “safety plan” for a bridge or a power plant should be a plan that makes it likely the bridge doesn’t fall down, and the power plant doesn’t explode, not a plan that somehow desperately tries to attach last-minute support to a bridge that is very likely to collapse.
        Like, I think our “alignment plan” should be a plan that has much of any chance of solving the problem, which I think the above doesn’t (which is why I’ve long advocated for calling the AI 2027 slowdown ending the “lucky” ending, because really most of what happens in it is that you get extremely lucky and reality turns out to be very unrealistically convenient).
        I don’t think the above plan is a reasonable plan in terms of risk tradeoffs, and in the context of this discussion I think we should mostly be like “yeah, I think you don’t really have a shot of solving the problem if you just keep pushing on the gas, I mean, maybe you get terribly lucky, but I don’t think it makes sense to describe that as a ‘plan’”.
        Like, approximately anyone “working on the above plan” would my guess have 20x-30x more impact if they instead focused their efforts on a mixture of a plan that works without getting extremely lucky, plus pushing timelines back to make it more likely that you have the time to implement such a plan.
        (Of course I expect some people to disagree with me here, maybe including you, which is fine, just trying to explain my perspective here)
        Daniel Kokotajlo 1 Nov 2025 18:59 UTC
        11 points
        4
        Parent
        I do wish we had called it the Lucky ending.
        I think you might be shadow boxing with me right now. I didn’t clarify what I meant by “I still think this is overall a decent high-level plan, possibly the best plan. (What are the best alternatives? Curious for answers to that question.)” hence your interpretation, but to belatedly clarify, I was trying to compare it to other technical alignment plans rather than to comprehensive plans which include governance components. I did not mean to imply that I think the best thing for people on the margin to work on right now is fleshing out and implementing this plan.