habryka comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

habryka 31 Oct 2025 21:28 UTC
7 points
3
(What are the best alternatives? Curious for answers to that question.)
Don’t build AGI for a long time. Probably make smarter humans. Build better coordination technology so that you can be very careful with scaling up. Do a huge enormous amount of mechinterp in the decades you have.
- Daniel Kokotajlo 31 Oct 2025 22:44 UTC
  5 points
  3
  Parent
  I was talking specifically about the alignment plan, the governance stuff is cheating. (I wasn’t proposing a comprehensive plan, just a technical alignment plan). Like, obviously buying more time to do research and be cautious would be great and I recommend doing that.
  
  I do consider “do loads of mechinterp” to be an answer to my question, I guess, but arguably the mechinterp-centric plan looks pretty similar. Maybe it replaces step 3 with retargeting the search?
  - habryka 31 Oct 2025 23:02 UTC
    3 points
    1
    Parent
    Hmm, I guess you mean “an alignment plan conditional on no governance plan?”. Which is IMO a kind of weird concept. Your “safety plan” for a bridge or a power plant should be a plan that makes it likely the bridge doesn’t fall down, and the power plant doesn’t explode, not a plan that somehow desperately tries to attach last-minute support to a bridge that is very likely to collapse.
    Like, I think our “alignment plan” should be a plan that has much of any chance of solving the problem, which I think the above doesn’t (which is why I’ve long advocated for calling the AI 2027 slowdown ending the “lucky” ending, because really most of what happens in it is that you get extremely lucky and reality turns out to be very unrealistically convenient).
    I don’t think the above plan is a reasonable plan in terms of risk tradeoffs, and in the context of this discussion I think we should mostly be like “yeah, I think you don’t really have a shot of solving the problem if you just keep pushing on the gas, I mean, maybe you get terribly lucky, but I don’t think it makes sense to describe that as a ‘plan’”.
    Like, approximately anyone “working on the above plan” would my guess have 20x-30x more impact if they instead focused their efforts on a mixture of a plan that works without getting extremely lucky, plus pushing timelines back to make it more likely that you have the time to implement such a plan.
    (Of course I expect some people to disagree with me here, maybe including you, which is fine, just trying to explain my perspective here)
    - Daniel Kokotajlo 1 Nov 2025 18:59 UTC
      11 points
      4
      Parent
      I do wish we had called it the Lucky ending.
      I think you might be shadow boxing with me right now. I didn’t clarify what I meant by “I still think this is overall a decent high-level plan, possibly the best plan. (What are the best alternatives? Curious for answers to that question.)” hence your interpretation, but to belatedly clarify, I was trying to compare it to other technical alignment plans rather than to comprehensive plans which include governance components. I did not mean to imply that I think the best thing for people on the margin to work on right now is fleshing out and implementing this plan.