habryka comments on Plans A, B, C, and D for misalignment risk

habryka 9 Oct 2025 2:15 UTC
19 points
15
Shouldn’t we at least proceed until we can’t very confidently proceed safely?
I mean, I think AI ending up uncontrollably powerful are on the order of 1-3% likely for the next generation of models. That seems far far too high. I think we are right now in a position where we can’t very confidently proceed safely.
- ryan_greenblatt 9 Oct 2025 2:55 UTC
  4 points
  2
  Parent
  Hmm, we probably disagree about the risk depending on what you mean by “uncontrollably powerful”, especially if the AI company didn’t have any particular reason to think the jump would be especially high (as is typically the case for new models).
  
  I’d guess it’s hard for a model to be “uncontrollably powerful” (in the sense that we would be taking on a bunch of risk from a Plan A perspective) unless it is at least pretty close to being able to automate AI R&D so this requires a pretty huge capabilities jump.
  
  My guess of direct risk^[1] from the next generation of models (as in, the next major release from Anthropic+xAI+OpenAI+GDM) would be like 0.3% and I’d be like 3x lower if we were proceeding decently cautiously in a Plan A style scenario (e.g. if we had an indication the model might be much more powerful, we’d scale only a bit at a time).
  
  My estimate for 0.3%: My median is 8.5 years and maybe there are ~3-ish major model releases per year, so assuming uniform we’d get 2% chance of going all the way to AI R&D automation this generation. Then, I cut by a factor of like 10 due to this being a much larger discontinuity than we’ve seen before and by another factor of 2 from this not being guaranteed to directly result in takeover. Then, I go back up a bunch due to model uncertainty and thinking that we might be especially likely to see a big advance around now.
  
  Edit: TBC, I think it’s reasonable to describe the current state of affairs as “we can’t very confidently proceed safely” but I also think the view “we can very confidently proceed safely (e.g., takeover risk from next model generation is <0.025%) given being decently cautious” is pretty reasonable.
  ↩︎
  By direct risk, I mean including takeover itself and risk increases through mechanisms like self-exfil, rogue internal deployment but not including sabotaging research the AI is supposed to be doing.