Because LLMs are already avoiding being shut down: https://arxiv.org/abs/2509.14260 .
Very interest, thanks. As I said in the review, I wish there was more of this kind of thing in the book.
If your terminal goal is to enjoy watching a good movie, you can’t achieve it if you’re dead/shut down.
If your terminal goal is for you to watch the movie, then sure. But if your terminal goal is that the movie be watched, then shutting you down might well be perfectly consistent with it.
Ok, let’s say there is an “in between” period, and let’s say we win the fight against a misaligned AI. After the fight, we will still be left with the same alignment problems, as other people in this thread pointed out. We will still need to figure out how to make safe, benevolent AI, because there is no guarantee that we will win the next fight, and the fight after that, and the one after that, etc
At that point, the shut down argument is no longer speculative, and you can probably actually do it.
To be clear, I’m not saying that’s a good plan if you can foresee all the developments in advance. But, if you’re uncertain about all of it, then it seems like there is likely to be a period of time before it’s necessarily too late when a lot of the uncertainty is resolved.
It’s a bit of both.
Suppose there are no warning shots. A hypothetical AI that’s a a bit weaker than humanity but still awfully impressive doesn’t do anything at all that manifests an intent to harm us. That could mean:
The next, somewhat more capable of this AI will not have any intent to harm us because through either luck or design we’ve ended up with a non-threatening AI.
This version of the AI is biding its time to strike and is sufficiently good at deception that we miss that fact.
This AI is fine, but making it a little smarter/more capable will somehow lead to the emergence of malign intent.
I take Yudkowsky and Soares to put all the weight on #2 and #3 (with, based on their scenario, perhaps more of it on #2).
I don’t think that’s right. I think if we have reached the point where an AI really could plausibly start and win a war with us and it doesn’t do anything nasty, there’s a fairly good chance we’re in #1. We may not even really understand how we got into #1, but sometimes things just work out.
I’m not saying this is some kind of great strategy for dealing with the risk; the scenario I’m describing is one where there’s a real chance we all die and I don’t think you get a strong signal until you get into the range where the AI might win, which is a bad range. But it’s still very different than imagining the AI will inherently wait to strike until it has ironclad advantages.