I’m not sure whether your reasoning is “well, in the scenarios with what I heard Yudkowsky calls a “fast take off”, we’re dead, so let’s think about other scenarios where we have an easier time having impact”. Like, no, we live in some specific world; if this is a world where surviving is hard, better try to actually save everyone, than flinch away and hope for a “slow takeoff”.
Yes, we live in some specific world, but we don’t know which. We can only guess. To simplify, if I have 80% belief that we live in a slow take off world, and 20% we live in a fast take off world, and I think that strategy A has a 50% chance of working in the former kind of world and none in the latter, whereas strategy B has a 5% chance of working regardless of the world, I’ll still go with strategy A because that gives me a 40% overall chance of getting out of this alive. And yes, in this case it helps that I do place a higher probability on this being a slow takeoff kind of world.
What’s your model of what you call “slow takeoff” after the AI is smarter than humans, including being better than humans at finding and exploiting zero-days? Or what’s your model for how we don’t get to that point?
Honestly I think it’s a tad more complicated than that. Don’t get me wrong, any world with smarter-than-humans AGI in it is already incredibly dangerous and tethering on the brink of a bunch of possible disasters, but I don’t expect it to end in instant paperclipping either. There are two main thrusts for this belief:
the means problem: to me, the big distinction between a slow and fast takeoff world is how much technological low-hanging fruit there exists, potentially, for such a smarter-than-humans AGI to reap and use. Whichever its goals, killing us is unlikely to be a terminal one. It can be instrumental, but then it’s only optimal policy if the AI can exist independently of us. In a world in which its intelligence is easily translated into effective replacements for us for all maintenance and infrastructural needs (e.g. repair nanobots and stuff), then we’re dead. In a world in which that’s not the case, we experience a time in which the AI behaves nicely and feeds us robot technology to put all the pieces in places before it can kill us;
the motive problem: I think that different kinds of AGI would also wildly differ in their drive to kill us at all. There are still inherent dangers from having them around, but not every AGI would be a paperclipper that very deliberately aims at removing us from the board as one of its first steps. I’d expect that from a general AlphaZero-like RL agent trained on a specific goal. I wouldn’t expect it from a really really smart LLM, because those are less focused (the goal that the LLM was trained on isn’t the same as the goal you give to the simulacra) and more shaped specifically by human content. This doesn’t make them aligned, but it makes them I think less alien than the former example, to an extent where probably they’d have a different approach. Again, I would still be extremely wary of them—I just don’t think they’d get murder-y as their very first step.
If your competence in other AI-related risks has already been confirmed by the experts who work for/with the government on eg biorisks, does talking about x-risk before warning shots happen make it harder to persuade them about x-risk later?
Ok, so to be clear, I’m not saying we should NOT talk about x-risk or push the fact that it’s absolutely an important possibility to always keep in mind. But I see this more as preparing the terrain. IF we ever get a warning shot, then, if the seed was planted already, we get a more coherent and consistent response. But I don’t think anyone would commit to anything sufficiently drastic on purely theoretical grounds. So I expect that right now the achievable wins are much more limited, and many are still good in how they might e.g. shift incentives to make alignment and interpretability more desirable and valuable than just blind capability improvement. But yes, by all means, x-risk should be put on the table right away.
Yes, we live in some specific world, but we don’t know which. We can only guess. To simplify, if I have 80% belief that we live in a slow take off world, and 20% we live in a fast take off world, and I think that strategy A has a 50% chance of working in the former kind of world and none in the latter, whereas strategy B has a 5% chance of working regardless of the world, I’ll still go with strategy A because that gives me a 40% overall chance of getting out of this alive. And yes, in this case it helps that I do place a higher probability on this being a slow takeoff kind of world.
Honestly I think it’s a tad more complicated than that. Don’t get me wrong, any world with smarter-than-humans AGI in it is already incredibly dangerous and tethering on the brink of a bunch of possible disasters, but I don’t expect it to end in instant paperclipping either. There are two main thrusts for this belief:
the means problem: to me, the big distinction between a slow and fast takeoff world is how much technological low-hanging fruit there exists, potentially, for such a smarter-than-humans AGI to reap and use. Whichever its goals, killing us is unlikely to be a terminal one. It can be instrumental, but then it’s only optimal policy if the AI can exist independently of us. In a world in which its intelligence is easily translated into effective replacements for us for all maintenance and infrastructural needs (e.g. repair nanobots and stuff), then we’re dead. In a world in which that’s not the case, we experience a time in which the AI behaves nicely and feeds us robot technology to put all the pieces in places before it can kill us;
the motive problem: I think that different kinds of AGI would also wildly differ in their drive to kill us at all. There are still inherent dangers from having them around, but not every AGI would be a paperclipper that very deliberately aims at removing us from the board as one of its first steps. I’d expect that from a general AlphaZero-like RL agent trained on a specific goal. I wouldn’t expect it from a really really smart LLM, because those are less focused (the goal that the LLM was trained on isn’t the same as the goal you give to the simulacra) and more shaped specifically by human content. This doesn’t make them aligned, but it makes them I think less alien than the former example, to an extent where probably they’d have a different approach. Again, I would still be extremely wary of them—I just don’t think they’d get murder-y as their very first step.
Ok, so to be clear, I’m not saying we should NOT talk about x-risk or push the fact that it’s absolutely an important possibility to always keep in mind. But I see this more as preparing the terrain. IF we ever get a warning shot, then, if the seed was planted already, we get a more coherent and consistent response. But I don’t think anyone would commit to anything sufficiently drastic on purely theoretical grounds. So I expect that right now the achievable wins are much more limited, and many are still good in how they might e.g. shift incentives to make alignment and interpretability more desirable and valuable than just blind capability improvement. But yes, by all means, x-risk should be put on the table right away.