It does seem likely to me that a large fraction of all “doom from unaligned AGI” comes relatively soon after the first AGI that is better at improving AGI than humans are. I tend to think of it as a question having multiple bundles of scenarios:
AGI is actually not something we can do. Even in timelines where we advance in such technology for a long time, we only get systems that are not as smart as us in ways that matter for control of the future. Alignment is irrelevant, and P(doom) is approximately 0.
Alignment turns out to be relatively easy and reliable. The only risk comes from AGI before anyone has a chance to find the easy and safe solution. Where the first AGIs are aligned, they can quite safely self-improve and remain aligned. With their capabilities they can easily spot and deal with the few unaligned AGIs as they come up before they become a problem. P(doom) is relatively low and stays low.
Alignment is difficult, but it turns out that once you’ve solved it, it’s solved. You can scale up the same principles to any level of capability. P(doom by year X) goes up higher than scenario 2 due to the reduced chance of solving before powerful AGI, but then plateaus rapidly in the same way.
Alignment is both difficult and risky. AGIs that self-improve by orders of magnitude face new alignment problems, and so the most highly capable AGIs are much more likely to be misaligned to humanity than less capable ones. P(doom by year X) keeps increasing for every year in which AGI plausibly exists, though the remaining probability mass is more and more heavily toward worlds in which civilization never develops AGI.
Alignment is essentially impossible. If we get superhuman AGIs at all, almost certainly one of the earliest kills everyone one way or another. P(doom by year X) goes quickly toward 1 for every possible future in which AGI plausibly exists.
Only in scenario 4 do you see a steady increase in P(doom) over long time spans, and even that bundle of timelines probably converges fairly rapidly toward timelines in which no AGI ever exists for some reason or other.
This is why I think it’s meaningful to ask for P(doom) without a specified time span. If we somehow found out that scenario 4 was actually true, then it might be worth asking in more detail about time scales.
It does seem likely to me that a large fraction of all “doom from unaligned AGI” comes relatively soon after the first AGI that is better at improving AGI than humans are. I tend to think of it as a question having multiple bundles of scenarios:
AGI is actually not something we can do. Even in timelines where we advance in such technology for a long time, we only get systems that are not as smart as us in ways that matter for control of the future. Alignment is irrelevant, and P(doom) is approximately 0.
Alignment turns out to be relatively easy and reliable. The only risk comes from AGI before anyone has a chance to find the easy and safe solution. Where the first AGIs are aligned, they can quite safely self-improve and remain aligned. With their capabilities they can easily spot and deal with the few unaligned AGIs as they come up before they become a problem. P(doom) is relatively low and stays low.
Alignment is difficult, but it turns out that once you’ve solved it, it’s solved. You can scale up the same principles to any level of capability. P(doom by year X) goes up higher than scenario 2 due to the reduced chance of solving before powerful AGI, but then plateaus rapidly in the same way.
Alignment is both difficult and risky. AGIs that self-improve by orders of magnitude face new alignment problems, and so the most highly capable AGIs are much more likely to be misaligned to humanity than less capable ones. P(doom by year X) keeps increasing for every year in which AGI plausibly exists, though the remaining probability mass is more and more heavily toward worlds in which civilization never develops AGI.
Alignment is essentially impossible. If we get superhuman AGIs at all, almost certainly one of the earliest kills everyone one way or another. P(doom by year X) goes quickly toward 1 for every possible future in which AGI plausibly exists.
Only in scenario 4 do you see a steady increase in P(doom) over long time spans, and even that bundle of timelines probably converges fairly rapidly toward timelines in which no AGI ever exists for some reason or other.
This is why I think it’s meaningful to ask for P(doom) without a specified time span. If we somehow found out that scenario 4 was actually true, then it might be worth asking in more detail about time scales.