Just FWIW, I have never heard of a remotely realistic-seeming story for how things will be OK, without something that looks like coordination to not build ASI for quite a while. The only difference in belief I might have from some of the vibes in the book is that I believe that maybe[1] we get lucky and we can use the AIs to coordinate humanity to not build ASI for quite a while.
But even from my epistemic vantage point, I think it’s kind of OK for the book to not talk through that specific plan, because it’s ex-ante pretty crazy. Hoping that the mildly superhuman AIs will turn out to be useful enough to prevent us from building ASI is really not a good plan. We might get lucky, but I do think you should really try very hard to not rely on getting that lucky.
AI is almost guaranteed to cause an existential catastrophe.
Relatedly, the book argues pretty clearly for “ASI, if built using the current methods we know about, is almost guaranteed to cause an existential catastrophe”, which is very importantly different from the quote above, and this still seems just really straightforwardly true to me. My guess is even with lots of AI assistance we will first end up doing something that looks more like “really slowing down a lot” if things end up fine.
I personally do think it would have been good for the book to say some more about how there is some chance we might end up being able to coordinate a pause using somewhat superhuman systems, but it does seem really hard to get across while also being clear that this would be a really bad and terrible plan to rely on, and the first and obvious thing to do is to not build these systems in the first place. Like, I feel like the reasonable reaction most people reading the book would have to such a section would be “why go into this in the first place if it seems like such a bad idea? I feel like you should have used this space fleshing out a plan that isn’t crazily risky”.
I think the discussion wouldn’t have to be like “here’s a crazy plan”.
I think there could have been something more like: “Important fact to understand about the situation: Even if superintelligence comes within the next 10 years, it’s pretty likely that sub-ASI systems will have had a huge impact on the world by then — changing the world in a few-year period more than any technology ever has changed the world in a few-year period. It’s hard to predict what this would look like [easy calls, hard calls, etc]. Some possible implications could be: [long list: …, automated alignment research, AI-enabled coordination, people being a lot more awake to the risks of ASI, lots of people being in relationships with AIs and being supportive of AI rights, not-egregiously-misaligned AIs that are almost as good at bio/cyber/etc as the superintelligences...]. Some of these things could be helpful, some could be harmful. Through making us more uncertain about the situation, this lowers our confidence that everyone will die. In particular, some chance that X, Y, Z turns out really helpful. But obviously, if we see humanity as an agent, it would be a dumb plan for humanity to just assume that this crazy, hard-to-predict mess will save the whole situation.”
I.e. it could be presented as an important thing to understand about the strategic situation rather than as a proposed plan.
Modulo the sentence “this lowers our confidence that everyone will die”, since I don’t think it’s what they believe, or what I believe, though it’s messy. My guess is this period is also majorly responsible for increasing risks by creating tons of economic momentum that then make it hard to stop when you get to really risky AI, and so my best guess is the overall technological diffusion will make things riskier instead of less, though I don’t have a strong take either way.
Having the economic incentives plus other things explained, and being like “and yep, this seems like it might make things worse or better, it makes it harder to be confident about how things go, though the core difficulty remains”, would be good.
Through making us more uncertain about the situation, this lowers our confidence that everyone will die.
This seems to rely on the assumption that “there’s nowhere to go but up”: that we’re pretty certain of doom, so wildcards in the future can only make us less certain. But I don’t think that works. Wildcards in the future can also increase s-risks, and there’s no limit how bad things can get, potentially much worse than extinction.
I have never heard of a remotely realistic-seeming story for how things will be OK, without something that looks like coordination to not build ASI for quite a while.
I wonder if we should talk about this at some point. This perspective feels pretty wild to me and I don’t immediately understand where the crux lives.
Do you think it will be extremely hard to avoid scheming in human-ish level AIs?
Do you think it will be extremely hard to get not-scheming-against-us human-ish level AIs aligned enough that handing over safety work to them is competitive with emulated humans (at the same speed and cost)?
Do you think that AIs trying hard to ongoingly ensure alignment and given some lead time will fail because alignment is much, much harder than capabilites? (E.g., full AGI can’t align +1 SD AGI given 1 wall clock year, or +1 SD AGI can’t align +2 SD AGI given 6 wall clock months, or so on.)
Maybe you’re including “>1 years of lead time spent on safety” under “coordination to not build ASI for quite a while” and you think this is extremely unlikely?
I wonder if we should talk about this at some point.
I would definitely be interested!
Some things to respond now:
One of the key thing is just that you end up with recursive self-improvement/automated AI software development and then everything happens much quicker. I think at the very least you need to intervene to stop that feedback loop. Like, we aren’t then talking about “scheming in human-ish level AIs”, we are then talking about “scheming in galaxy-brain level AIs”, and yes, I think it’s extremely unlikely that if you take anything remotely close to current AI systems and let them recursively self-improve/automate AI development at 1000x human speeds, you end up with a system that is aligned with humanity.
Do you think it will be extremely hard to get not-scheming-against-us human-ish level AIs aligned enough that handing over safety work to them is competitive with emulated humans (at the same speed and cost)?
This does seem very hard! But most importantly, it seems to me that among the first thing such human-ish level AIs would do is to coordinate to not have any subset of them build a much smarter system than themselves (including other actors that might be at different companies or datacenters).
Like, yeah, their work on facilitating a slowdown/pause might be at a kind of similar level to emulated humans. It again seems extremely unlikely that they would succeed at aligning a runaway intelligence explosion.
Maybe you’re including “>1 years of lead time spent on safety” under “coordination to not build ASI for quite a while” and you think this is extremely unlikely?
Wall clock time and subjective time come apart a good amount here, and it’s a bit confusing which one to care about. A few thoughts:
A lead time of >1 year does seem pretty unlikely at this point. My guess would be like 25% likely? So already this isn’t going to work in 75% of worlds.
Again, the key thing here is you need some force that prevents people and AIs from building vastly superhuman ASI systems.
It’s not implausible to me that then, with lots of AI assistance that isn’t at a critical level, you can start inching in slowly into the strongly superhuman domain without everything going badly. I think it’s very unlikely you can do it very quickly.
This might allow us to get to something like safe ASI on the scale of single-digit years, but man, this just seems like such an insane risk to take, that I really hope we instead use the AI systems to coordinate a longer pause, which seems like a much easier task.
I think at the very least you need to intervene to stop that feedback loop.
There’s probably at least some disagreement here. I think even if you let takeoff proceed at the default rate with a small fraction (e.g. 5%) explicitly spent on reasonably targeted alignment work at each point (as in, 5% beyond what is purely commercially expedient), you have a reasonable chance of avoiding AI takeover (maybe 50% chance of misaligned AI takeover?). Some of this is due to the possibility of takeoff being relatively slower and more compute constrained (which you might think is very unlikely?). I also think that there is a decent chance that you get a higher fraction spent on safety after handing off to AIs or after getting advice from highly capable AIs even if this doesn’t happen before this.
It again seems extremely unlikely that they would succeed at aligning a runaway intelligence explosion.
I don’t feel so confident—these AIs might have a decent amount of subjective time and total cognitive labor between each unit of increase in capabilities as the intelligence explosion continues such that they can keep things on track. Intuitively, capabilities might be more compute bottlenecked than alignment, so it should pull ahead if we can start with actually aligned (and wise) AIs (which is not easy to achieve to be clear!).
A lead time of >1 year does seem pretty unlikely at this point. My guess would be like 25% likely? So already this isn’t going to work in 75% of worlds.
I agree with around 25% likely.
This might allow us to get to something like safe ASI on the scale of single-digit years, but man, this just seems like such an insane risk to take, that I really hope we instead use the AI systems to coordinate a longer pause, which seems like a much easier task.
I agree that coordinating a longer pause looks pretty good, but I’m not so sure about the relative feasibility given only the use of AIs that are somewhat more capable than top human experts (regardless of whether these AIs are running things). I think it might be much harder to buy 10 years of time than 2 years given the constraints at the time (including limited political will) and I’m not so sure aligning somewhat more powerful AIs will be harder (and then these somewhat more powerful AIs can align even more powerful AIs and this either bottoms out in a scalable solution to alignment or in powerful enough capabilities that they actually can buy more time).
One general note: I do think that “buying time along the way” (either before handing off to AIs or after) is quite helpful for making the situation go well. However, I can also imagine worlds where things go fine and we didn’t buy much/any time (especially if takeoff is naturally on the slower side).
USG cares a decent amount and leading AI companies are on board, so they try to buy several additional years to work on safety.
We scale to roughly top human expert level while ensuring control.
Over time, we lower the risk of scheming at this level of capability through a bunch of empirical experiments and new interventions developed using a bunch of AI labor.
We relax our control measures and increasingly work on making AIs generally very trustworthy, including on hard-to-check open ended tasks. We do a bunch of studies of this.
This ends up not being that hard due to somewhat favorable generalization.
We handoff to AIs and they align their successors and so on.
USG cares a decent amount and leading AI companies are on board, so they try to buy several additional years to work on safety.
Your first paragraph is an example of “something that looks like coordination to not build ASI for quite a while”! “Several additional years” is definitely “quite a while”!
I am not sure whether the other bullet lists are all supposed to take place within those few years, or whether you are expecting further cautious actions that slow things down. It sounds like at least within the USG we are coordinating to not build ASI, and generally are successfully establishing going carefully and slowly.
And then even after these bullet lists are over, my best guess is the AIs we “handed over” to would still decide to go quite slowly themselves, probably establishing some global coordination to go sufficiently slowly. My best guess is we also will have just wanted to do that earlier in collaboration with those AI systems.
Your first paragraph is an example of “something that looks like coordination to not build ASI for quite a while”! “Several additional years” is definitely “quite a while”!
Ok, if you count a several additional years as quite a while, then we’re probably closer to agreement.
For this scenario, I was imagining all these actions happen within 2 years of lead time. In practice, we should keep trying to buy additional lead time prior to it making sense to handoff to AIs and the AIs we handoff to will probably want to try to buy lead time (especially if there are strategies which are easier post handoff, e.g. due to leveraging labor from more powerful systems).
I’m unsure about the difficulty of buying different amounts of lead time and it seems like it might be harder to buy lead time than to ongoingly ensure the alignment of later AIs. Eventually, we have to do some kind of a handoff and I think it’s safer to do this handoff to AIs that aren’t substantially more capable than top human experts in general purpose qualitative capabilties (like I think you want to handoff at roughly the minimum level of capability where the AIs are clearly capable enough to dominate humans, including at conceptually tricky work).
a remotely realistic-seeming story for how things will be OK, without something that looks like coordination to not build ASI for quite a while
My mainline scenario is something like:
LLM scaling and tinkering peters out in the next few years without reaching capacity for autonomous R&D. LLMs end up being good enough to displace some entry-level jobs, but the hype bubble bursts and we enter a new AI winter for at least a couple of decades.
The “intelligence” thingie turns out to be actually hard and not amenable to a bag of simple tricks with a mountain of compute, for reasons gestured at in Realism about rationality. Never mind ASI, we’re likely very far from being able to instantiate an AGI worthy of the name, which won’t happen while we remain essentially clueless about this stuff.
I also expect that each subsequent metaphorical AI “IQ point” will be harder to achieve, not easier, so no foom or swift takeover. Of course, even assuming all that, it still doesn’t guarantee that “things will be OK”, but I’m sufficiently uncertain either way.
Just FWIW, I have never heard of a remotely realistic-seeming story for how things will be OK, without something that looks like coordination to not build ASI for quite a while. The only difference in belief I might have from some of the vibes in the book is that I believe that maybe[1] we get lucky and we can use the AIs to coordinate humanity to not build ASI for quite a while.
But even from my epistemic vantage point, I think it’s kind of OK for the book to not talk through that specific plan, because it’s ex-ante pretty crazy. Hoping that the mildly superhuman AIs will turn out to be useful enough to prevent us from building ASI is really not a good plan. We might get lucky, but I do think you should really try very hard to not rely on getting that lucky.
Relatedly, the book argues pretty clearly for “ASI, if built using the current methods we know about, is almost guaranteed to cause an existential catastrophe”, which is very importantly different from the quote above, and this still seems just really straightforwardly true to me. My guess is even with lots of AI assistance we will first end up doing something that looks more like “really slowing down a lot” if things end up fine.
I personally do think it would have been good for the book to say some more about how there is some chance we might end up being able to coordinate a pause using somewhat superhuman systems, but it does seem really hard to get across while also being clear that this would be a really bad and terrible plan to rely on, and the first and obvious thing to do is to not build these systems in the first place. Like, I feel like the reasonable reaction most people reading the book would have to such a section would be “why go into this in the first place if it seems like such a bad idea? I feel like you should have used this space fleshing out a plan that isn’t crazily risky”.
Like 5-15%?
I think the discussion wouldn’t have to be like “here’s a crazy plan”.
I think there could have been something more like: “Important fact to understand about the situation: Even if superintelligence comes within the next 10 years, it’s pretty likely that sub-ASI systems will have had a huge impact on the world by then — changing the world in a few-year period more than any technology ever has changed the world in a few-year period. It’s hard to predict what this would look like [easy calls, hard calls, etc]. Some possible implications could be: [long list: …, automated alignment research, AI-enabled coordination, people being a lot more awake to the risks of ASI, lots of people being in relationships with AIs and being supportive of AI rights, not-egregiously-misaligned AIs that are almost as good at bio/cyber/etc as the superintelligences...]. Some of these things could be helpful, some could be harmful. Through making us more uncertain about the situation, this lowers our confidence that everyone will die. In particular, some chance that X, Y, Z turns out really helpful. But obviously, if we see humanity as an agent, it would be a dumb plan for humanity to just assume that this crazy, hard-to-predict mess will save the whole situation.”
I.e. it could be presented as an important thing to understand about the strategic situation rather than as a proposed plan.
I agree that a section like this would be good!
Modulo the sentence “this lowers our confidence that everyone will die”, since I don’t think it’s what they believe, or what I believe, though it’s messy. My guess is this period is also majorly responsible for increasing risks by creating tons of economic momentum that then make it hard to stop when you get to really risky AI, and so my best guess is the overall technological diffusion will make things riskier instead of less, though I don’t have a strong take either way.
Having the economic incentives plus other things explained, and being like “and yep, this seems like it might make things worse or better, it makes it harder to be confident about how things go, though the core difficulty remains”, would be good.
This seems to rely on the assumption that “there’s nowhere to go but up”: that we’re pretty certain of doom, so wildcards in the future can only make us less certain. But I don’t think that works. Wildcards in the future can also increase s-risks, and there’s no limit how bad things can get, potentially much worse than extinction.
I wonder if we should talk about this at some point. This perspective feels pretty wild to me and I don’t immediately understand where the crux lives.
Do you think it will be extremely hard to avoid scheming in human-ish level AIs?
Do you think it will be extremely hard to get not-scheming-against-us human-ish level AIs aligned enough that handing over safety work to them is competitive with emulated humans (at the same speed and cost)?
Do you think that AIs trying hard to ongoingly ensure alignment and given some lead time will fail because alignment is much, much harder than capabilites? (E.g., full AGI can’t align +1 SD AGI given 1 wall clock year, or +1 SD AGI can’t align +2 SD AGI given 6 wall clock months, or so on.)
Maybe you’re including “>1 years of lead time spent on safety” under “coordination to not build ASI for quite a while” and you think this is extremely unlikely?
I would definitely be interested!
Some things to respond now:
One of the key thing is just that you end up with recursive self-improvement/automated AI software development and then everything happens much quicker. I think at the very least you need to intervene to stop that feedback loop. Like, we aren’t then talking about “scheming in human-ish level AIs”, we are then talking about “scheming in galaxy-brain level AIs”, and yes, I think it’s extremely unlikely that if you take anything remotely close to current AI systems and let them recursively self-improve/automate AI development at 1000x human speeds, you end up with a system that is aligned with humanity.
This does seem very hard! But most importantly, it seems to me that among the first thing such human-ish level AIs would do is to coordinate to not have any subset of them build a much smarter system than themselves (including other actors that might be at different companies or datacenters).
Like, yeah, their work on facilitating a slowdown/pause might be at a kind of similar level to emulated humans. It again seems extremely unlikely that they would succeed at aligning a runaway intelligence explosion.
Wall clock time and subjective time come apart a good amount here, and it’s a bit confusing which one to care about. A few thoughts:
A lead time of >1 year does seem pretty unlikely at this point. My guess would be like 25% likely? So already this isn’t going to work in 75% of worlds.
Again, the key thing here is you need some force that prevents people and AIs from building vastly superhuman ASI systems.
It’s not implausible to me that then, with lots of AI assistance that isn’t at a critical level, you can start inching in slowly into the strongly superhuman domain without everything going badly. I think it’s very unlikely you can do it very quickly.
This might allow us to get to something like safe ASI on the scale of single-digit years, but man, this just seems like such an insane risk to take, that I really hope we instead use the AI systems to coordinate a longer pause, which seems like a much easier task.
There’s probably at least some disagreement here. I think even if you let takeoff proceed at the default rate with a small fraction (e.g. 5%) explicitly spent on reasonably targeted alignment work at each point (as in, 5% beyond what is purely commercially expedient), you have a reasonable chance of avoiding AI takeover (maybe 50% chance of misaligned AI takeover?). Some of this is due to the possibility of takeoff being relatively slower and more compute constrained (which you might think is very unlikely?). I also think that there is a decent chance that you get a higher fraction spent on safety after handing off to AIs or after getting advice from highly capable AIs even if this doesn’t happen before this.
I don’t feel so confident—these AIs might have a decent amount of subjective time and total cognitive labor between each unit of increase in capabilities as the intelligence explosion continues such that they can keep things on track. Intuitively, capabilities might be more compute bottlenecked than alignment, so it should pull ahead if we can start with actually aligned (and wise) AIs (which is not easy to achieve to be clear!).
I agree with around 25% likely.
I agree that coordinating a longer pause looks pretty good, but I’m not so sure about the relative feasibility given only the use of AIs that are somewhat more capable than top human experts (regardless of whether these AIs are running things). I think it might be much harder to buy 10 years of time than 2 years given the constraints at the time (including limited political will) and I’m not so sure aligning somewhat more powerful AIs will be harder (and then these somewhat more powerful AIs can align even more powerful AIs and this either bottoms out in a scalable solution to alignment or in powerful enough capabilities that they actually can buy more time).
One general note: I do think that “buying time along the way” (either before handing off to AIs or after) is quite helpful for making the situation go well. However, I can also imagine worlds where things go fine and we didn’t buy much/any time (especially if takeoff is naturally on the slower side).
Do you have a realistic seeming story in mind?
Sure, several. E.g.:
USG cares a decent amount and leading AI companies are on board, so they try to buy several additional years to work on safety.
We scale to roughly top human expert level while ensuring control.
Over time, we lower the risk of scheming at this level of capability through a bunch of empirical experiments and new interventions developed using a bunch of AI labor.
We relax our control measures and increasingly work on making AIs generally very trustworthy, including on hard-to-check open ended tasks. We do a bunch of studies of this.
This ends up not being that hard due to somewhat favorable generalization.
We handoff to AIs and they align their successors and so on.
Your first paragraph is an example of “something that looks like coordination to not build ASI for quite a while”! “Several additional years” is definitely “quite a while”!
I am not sure whether the other bullet lists are all supposed to take place within those few years, or whether you are expecting further cautious actions that slow things down. It sounds like at least within the USG we are coordinating to not build ASI, and generally are successfully establishing going carefully and slowly.
And then even after these bullet lists are over, my best guess is the AIs we “handed over” to would still decide to go quite slowly themselves, probably establishing some global coordination to go sufficiently slowly. My best guess is we also will have just wanted to do that earlier in collaboration with those AI systems.
Ok, if you count a several additional years as quite a while, then we’re probably closer to agreement.
For this scenario, I was imagining all these actions happen within 2 years of lead time. In practice, we should keep trying to buy additional lead time prior to it making sense to handoff to AIs and the AIs we handoff to will probably want to try to buy lead time (especially if there are strategies which are easier post handoff, e.g. due to leveraging labor from more powerful systems).
I’m unsure about the difficulty of buying different amounts of lead time and it seems like it might be harder to buy lead time than to ongoingly ensure the alignment of later AIs. Eventually, we have to do some kind of a handoff and I think it’s safer to do this handoff to AIs that aren’t substantially more capable than top human experts in general purpose qualitative capabilties (like I think you want to handoff at roughly the minimum level of capability where the AIs are clearly capable enough to dominate humans, including at conceptually tricky work).
My mainline scenario is something like:
LLM scaling and tinkering peters out in the next few years without reaching capacity for autonomous R&D. LLMs end up being good enough to displace some entry-level jobs, but the hype bubble bursts and we enter a new AI winter for at least a couple of decades.
The “intelligence” thingie turns out to be actually hard and not amenable to a bag of simple tricks with a mountain of compute, for reasons gestured at in Realism about rationality. Never mind ASI, we’re likely very far from being able to instantiate an AGI worthy of the name, which won’t happen while we remain essentially clueless about this stuff.
I also expect that each subsequent metaphorical AI “IQ point” will be harder to achieve, not easier, so no foom or swift takeover. Of course, even assuming all that, it still doesn’t guarantee that “things will be OK”, but I’m sufficiently uncertain either way.