ryan_greenblatt comments on Winning the power to lose

ryan_greenblatt 21 May 2025 22:20 UTC
24 points
7
I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop.

I’m going to try to quickly make the case for the value of a well-timed 2-year pause which occurs only in some conditions (conditions which seem likely to me but which probably seem unlikely to you). On my views, such a pause would cut the risk of misaligned AI takeover (as in, an AI successfully seizing a large fraction of power while this is unintended by its de facto developers) by around ¹⁄₂ or maybe ¹⁄₃.^[1]

I think the ideal (short) pause/halt/slowdown from my perspective would occur around the point when AIs are capable enough to automate all safety relevant work and would only halt/slow advancement in general underlying capability. So, broader real-world applications and integrations could continue as well as some types of further AI development which don’t improve generally applicable capabilities. (It might also be acceptable to train cheaper or faster AIs and to improve algorithms but not yet train an AI which substantially surpasses this fixed level of general ability.)

A bunch of the reason why I think a well-timed slowdown might be good is that default takeoff speeds might be very fast. For instance, you might go from something like the superhuman AI researcher level (AIs which are qualitatively similar in general capabilities to human experts and which can automate AI R&D) to very qualitatively superhuman AIs in less than a year, and possibly (as in the case of AI 2027) in less than 4 months. If these takeoff speeds are what would happen by default, this transition probably requires either slowing down or very quickly handing off alignment and safety work to (hopefully sufficiently aligned) AIs which naively seems very scary.

Note that in this fast of a takeoff, we might only have AIs which are sufficiently capable that a full (safe) handoff is in principle viable for a few months before we need to do this handoff. So, humans wouldn’t have time to see much of a feedback loop on deferring to these AIs and handing off the types of work we will ultimately need to hand off. In other words, the default pace of takeoff speeds would itself disrupt the feedback loops typically needed for safety research. We’d have some sense of what AIs are like based on earlier capabilities and we could try to extrapolate, but capabilities might be improving fast enough that our prior experience doesn’t transfer. Further, handing off extremely open-ended “wicked” tasks which are impossible for humans to directly verify/test might result in risks and difficulties which didn’t show up before.

My understanding is that you don’t think this fast of a takeoff is very likely and this informs your views on slowing down in the future. What about making the slowdown conditional on takeoff looking roughly this fast? We could look at how much AIs are accelerating progress and try to forecast various milestones, and then only slow down as is needed to ensure that the duration between “superhuman AI researcher level” and very qualitatively superhuman AI is at least 2.5 years. If the default duration looks like it will be much more than this, then no slowdown is needed. And, maybe on your views you think the default duration is very likely to be longer than 2.5 years?

If the default is that we’re going from human-ish level AIs to radically superhuman AIs in less than a year, then I think there is a pretty strong intuitive case that a slowdown considerably improves the chance that humans stay in control (at least temporarily). And, if you’re more worried about fatalities, the picture is similar (at least insofar as you agree that human AI developers rapidly losing control to AI systems will cause higher human fatalities).

Here’s a another way to put this case for delay conditional on quite fast takeoff: By default, at the point of full AI R&D automation humans might have only another few months of time to work on alignment prior to needing to handoff to AI systems (or some worse option). Additionally, only around 3-20% of the effort would be focused on safety relevant alignment by default. If instead add 2 years of delay and spend most of the effort in these years on alignment, that would increase from a few months at maybe 7% effort to 2 years at maybe 70% effort. This is a 10x increase in serial time and a 10x increase in effort during that serial time. There would be substantial prior work on alignment, but it might not transfer sufficiently (and might just not be sufficient given that earlier work wasn’t accelerated by AIs), so work during this period could be a key bottleneck. Most guesses about returns to effort would have this reduce risk by a decent amount given the large increase increase in overall effort and time while having access to a critical level of AI systems. More simply: Prima fascia, it seems like putting in much more work on a problem would be very helpful.

We’d also want to make the slowdown conditional on not immediately having sufficiently robust alignment that we’re quite confident rapidly handing off is safe. But, I’m quite skeptical we’ll have this quickly (and I’d guess you’d agree?) so I don’t think this makes a big difference to the bottom line.

Of course, there would still be serious implementation difficulties in actually implementing a well-timed conditional slowdown. And, operationalizing the exact criteria would be important.

I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action.

Interestingly, I have the opposite view: a well-timed slowdown would probably reduce concentration of power, at least if takeoff would otherwise have been fast. If takeoff is quite fast, then the broader world won’t have much time to respond to developments which would make it more likely that power would greatly concentrate by default. People would need time to notice the situation and take measures to avoid being disempowered. As a more specific case, AI-enabled coups seem much more likely if takeoff is fast and thus intervening to slow down takeoff (so there is more time for various controls etc. to be put in place) would help a lot with that.

I think this effect is substantially larger than the (centralization, less dynamism, etc.) costs needed to enforce a 1-2 year slowdown. (Separately, I expect things probably will be so concentrated by default, that the additional requirements to enforce a 1-2 year slowdown seem pretty negligible in comparison. I can easily imagine the deals etc. made to enforce a slowdown decentralizing power on net (as it would require oversight by a larger number of actors and more humans to get some influence over the situation), though this presumably wouldn’t be the easiest way to achieve this objective. I think a situation pretty similar to the AI 2027 scenario where an extremely small group of people have massive de facto power is quite likely, and this could easily result in pretty close to maximal concentration of power longer term.)

Suppose we could do a reasonable job implementing a conditional slowdown like this where we try to ensure at least a 2.5 year gap (if alignment issues aren’t robustly solved) between full AI R&D automation and very qualitatively superhuman AI. Do you think such a slowdown would be good on your views and values?
1. ↩︎
  My views are that misaligned AI takeover is about 30% likely. Conditional on misaligned AI takeover, I’d guess (with very low confidence) that maybe ¹⁄₂ of humans die in expectation with a ¹⁄₄ chance of literal human extinction. Interestingly, this means we don’t disagree that much about the chance that AI will directly cause humanity to go extinct in the next 30 years, I’d put around 6% on this claim and you’re at 4%. (6% = 85% chance of TAI, 30% takeover conditional on TAI, 25% chance of extinction.) However, as found in prior conversations, we do disagree a bunch on how bad misaligned AI takeover is for various reasons. It’s also worth noting that in some worlds where humans survive, they (or some fraction of them) might be mistreated by the AI systems with power over them in ways which make their lives substantially worse than they are now. So, overall, my sense is that from a myopic perspective that only cares about the lives of currently aligned humans, misaligned AI takeover is roughly as bad as ³⁄₅ of people dying in expectation. So, if we think each year of delay costs the equivalent of 0.5% of humans dying and we only care about currently living humans, then a ~1/40th reduction in takeover risk is worth a year of delay on my views.
What links here?
- ryan_greenblatt's comment on Winning the power to lose by KatjaGrace (25 May 2025 13:29 UTC; 5 points)
- Matthew Barnett's comment on Winning the power to lose by KatjaGrace (25 May 2025 21:55 UTC; -4 points)