ryan_greenblatt comments on I enjoyed most of IABIED

ryan_greenblatt 19 Sep 2025 17:14 UTC
24 points
3
I have never heard of a remotely realistic-seeming story for how things will be OK, without something that looks like coordination to not build ASI for quite a while.

I wonder if we should talk about this at some point. This perspective feels pretty wild to me and I don’t immediately understand where the crux lives.
- Do you think it will be extremely hard to avoid scheming in human-ish level AIs?
- Do you think it will be extremely hard to get not-scheming-against-us human-ish level AIs aligned enough that handing over safety work to them is competitive with emulated humans (at the same speed and cost)?
- Do you think that AIs trying hard to ongoingly ensure alignment and given some lead time will fail because alignment is much, much harder than capabilites? (E.g., full AGI can’t align +1 SD AGI given 1 wall clock year, or +1 SD AGI can’t align +2 SD AGI given 6 wall clock months, or so on.)
- Maybe you’re including “>1 years of lead time spent on safety” under “coordination to not build ASI for quite a while” and you think this is extremely unlikely?
- habryka 19 Sep 2025 18:06 UTC
  8 points
  2
  Parent
  I wonder if we should talk about this at some point.
  I would definitely be interested!
  Some things to respond now:
  One of the key thing is just that you end up with recursive self-improvement/automated AI software development and then everything happens much quicker. I think at the very least you need to intervene to stop that feedback loop. Like, we aren’t then talking about “scheming in human-ish level AIs”, we are then talking about “scheming in galaxy-brain level AIs”, and yes, I think it’s extremely unlikely that if you take anything remotely close to current AI systems and let them recursively self-improve/automate AI development at 1000x human speeds, you end up with a system that is aligned with humanity.
  Do you think it will be extremely hard to get not-scheming-against-us human-ish level AIs aligned enough that handing over safety work to them is competitive with emulated humans (at the same speed and cost)?
  This does seem very hard! But most importantly, it seems to me that among the first thing such human-ish level AIs would do is to coordinate to not have any subset of them build a much smarter system than themselves (including other actors that might be at different companies or datacenters).
  Like, yeah, their work on facilitating a slowdown/pause might be at a kind of similar level to emulated humans. It again seems extremely unlikely that they would succeed at aligning a runaway intelligence explosion.
  Maybe you’re including “>1 years of lead time spent on safety” under “coordination to not build ASI for quite a while” and you think this is extremely unlikely?
  Wall clock time and subjective time come apart a good amount here, and it’s a bit confusing which one to care about. A few thoughts:
  - A lead time of >1 year does seem pretty unlikely at this point. My guess would be like 25% likely? So already this isn’t going to work in 75% of worlds.
  - Again, the key thing here is you need some force that prevents people and AIs from building vastly superhuman ASI systems.
  - It’s not implausible to me that then, with lots of AI assistance that isn’t at a critical level, you can start inching in slowly into the strongly superhuman domain without everything going badly. I think it’s very unlikely you can do it very quickly.
  - This might allow us to get to something like safe ASI on the scale of single-digit years, but man, this just seems like such an insane risk to take, that I really hope we instead use the AI systems to coordinate a longer pause, which seems like a much easier task.
  - ryan_greenblatt 19 Sep 2025 23:48 UTC
    6 points
    1
    Parent
    
    I think at the very least you need to intervene to stop that feedback loop.
    
    There’s probably at least some disagreement here. I think even if you let takeoff proceed at the default rate with a small fraction (e.g. 5%) explicitly spent on reasonably targeted alignment work at each point (as in, 5% beyond what is purely commercially expedient), you have a reasonable chance of avoiding AI takeover (maybe 50% chance of misaligned AI takeover?). Some of this is due to the possibility of takeoff being relatively slower and more compute constrained (which you might think is very unlikely?). I also think that there is a decent chance that you get a higher fraction spent on safety after handing off to AIs or after getting advice from highly capable AIs even if this doesn’t happen before this.
    
    It again seems extremely unlikely that they would succeed at aligning a runaway intelligence explosion.
    
    I don’t feel so confident—these AIs might have a decent amount of subjective time and total cognitive labor between each unit of increase in capabilities as the intelligence explosion continues such that they can keep things on track. Intuitively, capabilities might be more compute bottlenecked than alignment, so it should pull ahead if we can start with actually aligned (and wise) AIs (which is not easy to achieve to be clear!).
    
    A lead time of >1 year does seem pretty unlikely at this point. My guess would be like 25% likely? So already this isn’t going to work in 75% of worlds.
    
    I agree with around 25% likely.
    
    This might allow us to get to something like safe ASI on the scale of single-digit years, but man, this just seems like such an insane risk to take, that I really hope we instead use the AI systems to coordinate a longer pause, which seems like a much easier task.
    
    I agree that coordinating a longer pause looks pretty good, but I’m not so sure about the relative feasibility given only the use of AIs that are somewhat more capable than top human experts (regardless of whether these AIs are running things). I think it might be much harder to buy 10 years of time than 2 years given the constraints at the time (including limited political will) and I’m not so sure aligning somewhat more powerful AIs will be harder (and then these somewhat more powerful AIs can align even more powerful AIs and this either bottoms out in a scalable solution to alignment or in powerful enough capabilities that they actually can buy more time).
    
    One general note: I do think that “buying time along the way” (either before handing off to AIs or after) is quite helpful for making the situation go well. However, I can also imagine worlds where things go fine and we didn’t buy much/any time (especially if takeoff is naturally on the slower side).
    What links here?
    Thane Ruthenis's comment on faul_sname’s Shortform by faul_sname (18 Oct 2025 13:20 UTC; 8 points)
    ryan_greenblatt's comment on faul_sname’s Shortform by faul_sname (19 Oct 2025 4:32 UTC; 6 points)
    Alignment progress doesn’t compensate for higher capabilities by Joe Rogero (9 Oct 2025 16:06 UTC; 2 points)
- Hastings 19 Sep 2025 17:39 UTC
  3 points
  0
  Parent
  Do you have a realistic seeming story in mind?
  - ryan_greenblatt 19 Sep 2025 18:05 UTC
    5 points
    5
    Parent
    Sure, several. E.g.:
    
    USG cares a decent amount and leading AI companies are on board, so they try to buy several additional years to work on safety.
    We scale to roughly top human expert level while ensuring control.
    Over time, we lower the risk of scheming at this level of capability through a bunch of empirical experiments and new interventions developed using a bunch of AI labor.
    We relax our control measures and increasingly work on making AIs generally very trustworthy, including on hard-to-check open ended tasks. We do a bunch of studies of this.
    This ends up not being that hard due to somewhat favorable generalization.
    We handoff to AIs and they align their successors and so on.
    What links here?
    4 Scenarios & 2 Modifiers: My Current AI-Progress Models by Thane Ruthenis (3 Oct 2025 21:33 UTC; 9 points)
    - habryka 19 Sep 2025 18:19 UTC
      16 points
      0
      Parent
      USG cares a decent amount and leading AI companies are on board, so they try to buy several additional years to work on safety.
      Your first paragraph is an example of “something that looks like coordination to not build ASI for quite a while”! “Several additional years” is definitely “quite a while”!
      I am not sure whether the other bullet lists are all supposed to take place within those few years, or whether you are expecting further cautious actions that slow things down. It sounds like at least within the USG we are coordinating to not build ASI, and generally are successfully establishing going carefully and slowly.
      And then even after these bullet lists are over, my best guess is the AIs we “handed over” to would still decide to go quite slowly themselves, probably establishing some global coordination to go sufficiently slowly. My best guess is we also will have just wanted to do that earlier in collaboration with those AI systems.
      - ryan_greenblatt 20 Sep 2025 0:02 UTC
        5 points
        0
        Parent
        
        Your first paragraph is an example of “something that looks like coordination to not build ASI for quite a while”! “Several additional years” is definitely “quite a while”!
        
        Ok, if you count a several additional years as quite a while, then we’re probably closer to agreement.
        
        For this scenario, I was imagining all these actions happen within 2 years of lead time. In practice, we should keep trying to buy additional lead time prior to it making sense to handoff to AIs and the AIs we handoff to will probably want to try to buy lead time (especially if there are strategies which are easier post handoff, e.g. due to leveraging labor from more powerful systems).
        
        I’m unsure about the difficulty of buying different amounts of lead time and it seems like it might be harder to buy lead time than to ongoingly ensure the alignment of later AIs. Eventually, we have to do some kind of a handoff and I think it’s safer to do this handoff to AIs that aren’t substantially more capable than top human experts in general purpose qualitative capabilties (like I think you want to handoff at roughly the minimum level of capability where the AIs are clearly capable enough to dominate humans, including at conceptually tricky work).