Matthew Barnett comments on Winning the power to lose

Matthew Barnett 21 May 2025 18:57 UTC
11 points
2
For me, I’m at ~10% P(doom). Whether I’d accept a proposed slowdown depends on how much I expect it decrease this number.^[2]
How do you model this situation? (also curious on your numbers)
I put the probability that AI will directly cause humanity to go extinct within the next 30 years at roughly 4%. By contrast, over the next 10,000 years, my p(doom) is substantially higher, as humanity could vanish for many different possible reasons, and forecasting that far ahead is almost impossible. I think a pause in AI development matters most for reducing the near-term, direct AI-specific risk, since the far-future threats are broader, more systemic, harder to influence, and only incidentally involve AI as a byproduct of the fact that AIs will be deeply embedded in our world.
I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop. This intuition is also informed by my personal assessment of the contributions LW-style theoretical research has made toward making existing AI systems safe—which, as far as I can tell, has been almost negligible (though I’m not implying that all safety research is similarly ineffective or useless).
I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action. Some AI pause proposals seem far too authoritarian or even totalitarian to me, providing another independent reason why I oppose pausing AI.
Additionally, I think that when AI is developed, it won’t merely accelerate life-extension technologies and save old people’s lives; it will likely also make our lives vastly richer and more interesting. I’m excited about that future, and I want the 8 billion humans alive today to have the opportunity to experience it. This consideration adds another important dimension beyond merely counting potential lives lost, again nudging me towards supporting acceleration.
Overall, the arguments in favor of pausing AI seem surprisingly weak to me, considering the huge potential upsides from AI development, my moral assessment of the costs and benefits, my low estimation of the direct risk from misaligned AI over the next 30 years, and my skepticism about how much pausing AI would genuinely reduce AI risks.
- ryan_greenblatt 21 May 2025 22:20 UTC
  24 points
  7
  Parent
  I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop.
  
  I’m going to try to quickly make the case for the value of a well-timed 2-year pause which occurs only in some conditions (conditions which seem likely to me but which probably seem unlikely to you). On my views, such a pause would cut the risk of misaligned AI takeover (as in, an AI successfully seizing a large fraction of power while this is unintended by its de facto developers) by around ¹⁄₂ or maybe ¹⁄₃.^[1]
  
  I think the ideal (short) pause/halt/slowdown from my perspective would occur around the point when AIs are capable enough to automate all safety relevant work and would only halt/slow advancement in general underlying capability. So, broader real-world applications and integrations could continue as well as some types of further AI development which don’t improve generally applicable capabilities. (It might also be acceptable to train cheaper or faster AIs and to improve algorithms but not yet train an AI which substantially surpasses this fixed level of general ability.)
  
  A bunch of the reason why I think a well-timed slowdown might be good is that default takeoff speeds might be very fast. For instance, you might go from something like the superhuman AI researcher level (AIs which are qualitatively similar in general capabilities to human experts and which can automate AI R&D) to very qualitatively superhuman AIs in less than a year, and possibly (as in the case of AI 2027) in less than 4 months. If these takeoff speeds are what would happen by default, this transition probably requires either slowing down or very quickly handing off alignment and safety work to (hopefully sufficiently aligned) AIs which naively seems very scary.
  
  Note that in this fast of a takeoff, we might only have AIs which are sufficiently capable that a full (safe) handoff is in principle viable for a few months before we need to do this handoff. So, humans wouldn’t have time to see much of a feedback loop on deferring to these AIs and handing off the types of work we will ultimately need to hand off. In other words, the default pace of takeoff speeds would itself disrupt the feedback loops typically needed for safety research. We’d have some sense of what AIs are like based on earlier capabilities and we could try to extrapolate, but capabilities might be improving fast enough that our prior experience doesn’t transfer. Further, handing off extremely open-ended “wicked” tasks which are impossible for humans to directly verify/test might result in risks and difficulties which didn’t show up before.
  
  My understanding is that you don’t think this fast of a takeoff is very likely and this informs your views on slowing down in the future. What about making the slowdown conditional on takeoff looking roughly this fast? We could look at how much AIs are accelerating progress and try to forecast various milestones, and then only slow down as is needed to ensure that the duration between “superhuman AI researcher level” and very qualitatively superhuman AI is at least 2.5 years. If the default duration looks like it will be much more than this, then no slowdown is needed. And, maybe on your views you think the default duration is very likely to be longer than 2.5 years?
  
  If the default is that we’re going from human-ish level AIs to radically superhuman AIs in less than a year, then I think there is a pretty strong intuitive case that a slowdown considerably improves the chance that humans stay in control (at least temporarily). And, if you’re more worried about fatalities, the picture is similar (at least insofar as you agree that human AI developers rapidly losing control to AI systems will cause higher human fatalities).
  
  Here’s a another way to put this case for delay conditional on quite fast takeoff: By default, at the point of full AI R&D automation humans might have only another few months of time to work on alignment prior to needing to handoff to AI systems (or some worse option). Additionally, only around 3-20% of the effort would be focused on safety relevant alignment by default. If instead add 2 years of delay and spend most of the effort in these years on alignment, that would increase from a few months at maybe 7% effort to 2 years at maybe 70% effort. This is a 10x increase in serial time and a 10x increase in effort during that serial time. There would be substantial prior work on alignment, but it might not transfer sufficiently (and might just not be sufficient given that earlier work wasn’t accelerated by AIs), so work during this period could be a key bottleneck. Most guesses about returns to effort would have this reduce risk by a decent amount given the large increase increase in overall effort and time while having access to a critical level of AI systems. More simply: Prima fascia, it seems like putting in much more work on a problem would be very helpful.
  
  We’d also want to make the slowdown conditional on not immediately having sufficiently robust alignment that we’re quite confident rapidly handing off is safe. But, I’m quite skeptical we’ll have this quickly (and I’d guess you’d agree?) so I don’t think this makes a big difference to the bottom line.
  
  Of course, there would still be serious implementation difficulties in actually implementing a well-timed conditional slowdown. And, operationalizing the exact criteria would be important.
  
  I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action.
  
  Interestingly, I have the opposite view: a well-timed slowdown would probably reduce concentration of power, at least if takeoff would otherwise have been fast. If takeoff is quite fast, then the broader world won’t have much time to respond to developments which would make it more likely that power would greatly concentrate by default. People would need time to notice the situation and take measures to avoid being disempowered. As a more specific case, AI-enabled coups seem much more likely if takeoff is fast and thus intervening to slow down takeoff (so there is more time for various controls etc. to be put in place) would help a lot with that.
  
  I think this effect is substantially larger than the (centralization, less dynamism, etc.) costs needed to enforce a 1-2 year slowdown. (Separately, I expect things probably will be so concentrated by default, that the additional requirements to enforce a 1-2 year slowdown seem pretty negligible in comparison. I can easily imagine the deals etc. made to enforce a slowdown decentralizing power on net (as it would require oversight by a larger number of actors and more humans to get some influence over the situation), though this presumably wouldn’t be the easiest way to achieve this objective. I think a situation pretty similar to the AI 2027 scenario where an extremely small group of people have massive de facto power is quite likely, and this could easily result in pretty close to maximal concentration of power longer term.)
  
  Suppose we could do a reasonable job implementing a conditional slowdown like this where we try to ensure at least a 2.5 year gap (if alignment issues aren’t robustly solved) between full AI R&D automation and very qualitatively superhuman AI. Do you think such a slowdown would be good on your views and values?
  ↩︎
  My views are that misaligned AI takeover is about 30% likely. Conditional on misaligned AI takeover, I’d guess (with very low confidence) that maybe ¹⁄₂ of humans die in expectation with a ¹⁄₄ chance of literal human extinction. Interestingly, this means we don’t disagree that much about the chance that AI will directly cause humanity to go extinct in the next 30 years, I’d put around 6% on this claim and you’re at 4%. (6% = 85% chance of TAI, 30% takeover conditional on TAI, 25% chance of extinction.) However, as found in prior conversations, we do disagree a bunch on how bad misaligned AI takeover is for various reasons. It’s also worth noting that in some worlds where humans survive, they (or some fraction of them) might be mistreated by the AI systems with power over them in ways which make their lives substantially worse than they are now. So, overall, my sense is that from a myopic perspective that only cares about the lives of currently aligned humans, misaligned AI takeover is roughly as bad as ³⁄₅ of people dying in expectation. So, if we think each year of delay costs the equivalent of 0.5% of humans dying and we only care about currently living humans, then a ~1/40th reduction in takeover risk is worth a year of delay on my views.
  What links here?
  - ryan_greenblatt's comment on Winning the power to lose by KatjaGrace (25 May 2025 13:29 UTC; 5 points)
  - Matthew Barnett's comment on Winning the power to lose by KatjaGrace (25 May 2025 21:55 UTC; -4 points)
- ryan_greenblatt 21 May 2025 23:15 UTC
  13 points
  2
  Parent
  
  This intuition is also informed by my personal assessment of the contributions LW-style theoretical research has made toward making existing AI systems safe—which, as far as I can tell, has been almost negligible (though I’m not implying that all safety research is similarly ineffective or useless).
  
  I know what you mean by “LW-style theoretical research” (edit: actually, not that confident I know what you mean, see thread below), but it’s worth noting that right now on LW people appear to be much more into empirical research than theoretical research. Concretely, go to All posts in 2024 sorted by Top and then filter by AI. Out of the top 32 posts, 0 are theoretical research and roughly ⁷⁄₃₂ are empirical research. 1 or 2 out of 32 are discussion which is relatively pro-theoretical research and a bunch more (maybe 20) are well described as AI futurism or discussion of what research directions or safety strategies are best which is relatively focused on empirical approaches. LW has basically given up on LW-style theoretical research based on the top 32 posts. (One of the top 32 posts is actually a post which is arguably complaining about how the field of alignment has given up on LW-style theoretical research!)
  
  Separately, I don’t think pessimism about LW-style theoretical research has a clear cut effect on how you should feel about a pause. The more you’re skeptical of work done in advance, the more you should think that additional work at done when we have more powerful AIs is a higher fraction of the action. This could be outweighed by generally being more skeptical about the returns to safety research as informed by this example subfield of safety research being poor, but still.
  
  Also, it’s worth noting that almost everyone in the field is pessimistic about LW-style theoretical research! This isn’t a very controversial view. The main disagreements (at least on LW) tend to be more about how optimistic you are about empirical research and about different types of empirical research.
  - habryka 21 May 2025 23:24 UTC
    17 points
    14
    Parent
    (I will go on the record that I think this comment seems to me terribly confused about what “LW style theoretic research” is. In-particular, I think of Redwood as one of the top organizations doing LW style theoretic research, with a small empirical component, and so clearly some kind of mismatch about concepts is going on here. AI 2027 also strikes me as very centrally the kind of “theoretical” thinking that characterizes LW.
    My sense is some kind of weird thing is happening where people conjure up some extremely specific thing as the archetype of LW-style research, in ways that is kind of disconnected from reality, and I would like to avoid people forming annoyingly hard to fix stereotypes as a result of that)
    - ryan_greenblatt 22 May 2025 0:01 UTC
      3 points
      0
      Parent
      I’m using the word “theoretical” more narrowly than you and not including conceptual/AI-futurism research. I agree the word “theoretical” is underdefined and there is a reasonable category that includes Redwood and AI 2027 which you could call theoretical research, I’d just typically use a different term for this and I don’t think Matthew was including this.
      
      I was trying to discuss what I thought Matthew was pointing at, I could be wrong about this of course.
      
      (Similarly, I’d guess that Matthew wouldn’t have counted Epoch’s work on takeoff speeds and what takeoff looks like as an example of “LW-style theoretical research”, but I think this work is very structurally/methodologically similar to stuff like AI 2027.”)
      
      If Matthew said “LW-style conceptual/non-empirical research” I would have interpreted this pretty differently.
- Random Developer 25 May 2025 19:58 UTC
  6 points
  0
  Parent
  I am clearly coming from a very different set of assumptions! I have:
  - P(AGI within 10 years) = 0.5. This is probably too conservative, given that many of the actual engineers with inside knowledge place this number much higher in anonymous surveys.
  - P(ASI within 5 years|AGI) = 0.9.
  - P(loss of control within 5 years|ASI) > 0.9. Basically, I believe “alignment” is a fairy tale, that it’s Not Even Wrong.
  If I do the math, that gives me a 40.5% chance that humans will completely lose control over the future within 20 years. Which seems high to me at first glance, but I’m willing to go with that.
  The one thing I can’t figure out how to estimate is:
  - P(ASI is benevolent|uncontrolled ASI) = ???
  I think that there are only a few ways the future is likely to go:
  1. AI progress hits a wall, hard.
  2. We have a permanent, worldwide moratorium on more advanced models. Picture a US/China/EU treaty backed up by military force, if you want to get dystopian about it.
  3. An ASI decides humans are surplus to requirements.
  4. An ASI decides that humans are adorable pets and it wants keep some of this around. This is the only place we get any “utopian” benefits, and it’s the utopia of being a domesticated animal with no ability to control its fate.
  I support a permanent halt. I have no expectation that this will happen. I think building ASI is equivalent to BASE jumping in a wingsuit, except even more likely to end horribly.
  So I also support mitigation and delay. If the human race has incurable, metastatic cancer, the remaining variable we control is how many good years we get before the end.
  - FVelde 25 Aug 2025 14:04 UTC
    1 point
    0
    Parent
    Could you give the source(s) of these anonymous surveys of engineers with insider knowledge about the arrival of AGI? I would be interested in seeing them.
    - Random Developer 25 Aug 2025 18:15 UTC
      1 point
      0
      Parent
      Unfortunately, it was about 3 or 4 months ago, and I haven’t been able to find the source. Maybe something Zvi Mowshowitz linked to in a weekly update?
      
      I am incredibly frustrated that web search is a swamp of AI spam, and tagged bookmarking tools like Delicious and Pinboard have been gone or unreliable for years.