[other contributors likely disagree with me in various places]
The Palisade shut down paper shows that LLMs will in fact resist a shutdown order under certain conditions (this holds across a wide variety of contexts, and some are worse than others). It’s easy to imagine regimes in which it is hard to shut down an LLM (i.e. if your LLM hacks in ways that leverage a larger action space than one may otherwise expect). In particular, if you haven’t set up infrastructure in advance that makes it possible to, e.g., non-catastrophically cut power to a data center, guarantee that the model hasn’t exfiltrated itself, etc… a whole bunch of sorta adventurous scenarios.[1] The ability to actually shut down models that exhibit worrisome behavior basically does not exist at this point, and there are ways to give models kind of a lot of access to the external world (scaffolding), which it could, beyond some capabilities threshold, use to protect itself or evade shutdown. The capacity to protect ourselves against these dangerous capabilities is precisely the ‘off switch’ idea from the post. I read you as saying “just turn it off” and I respond by saying “Yeah, we want to be able to; there’s just literally not a way to do that at scale” and, especially not a way to do that at scale if the systems are doing anything at all to try and stop you (as they’ve shown some propensity for above).
that is the same type of thing as GPT5, but much smarter, that doesn’t have this same property.
This article purposely does not take a strong stance on whether we ought to be worried about GPT-N or some new AI paradigm developed with the help of GPT-N’s massive potential acceleration of software development. Many, for instance, agent foundations researchers believe that LLMs can’t/won’t scale to ASI for various reasons, but are still concerned about x-risk, and may still have pretty short timelines, owing to general acceleration toward superintelligence, precipitated by LLMs (either as coding assistants or stimulus for investment into the space or something else).
(to be clear, at least two (four, depending on who you ask) of the labs are now publicly aiming at superintelligence, which is the exact type of thing we think is a very bad idea; I introduce this just to say ‘it is not only doomers who think LLMs play a role in the development of smarter-than-human machines’, be those LLMs or some novel post-LLM architecture)
(e.g., gpt5 was predictable from looking at capabilities of gpt2 and reading kaplan et. al—a system that has the properties described above is not)
We see rapid capability development in strategically-relevant domains (math, science, coding, game-playing), and we see LLMs dipping their toes (at the very least) into concerning actions in experimental settings (and, to some extent, in real life). Seeing GPT-5 in GPT-2 because you’ve read about scaling laws doesn’t seem that different from:
Observing LLMs get good at stuff fast (if the techniques for making them good at that stuff are scalable)
Observing LLMs sometimes get good at things you don’t intend them to get good at (as a side-effect of you making them very good at other things intentionally)
Observing that the ceiling on LLM capabilities in narrow domains appears to be north of human level
Observing that LLMs are starting to get good at spooky dark arts-y stuff like lying and cheating and hacking and driving people insane (without us wanting them to!)
Conclude it seems likely that they continue getting better at those dark artsy things, good enough to outclass humans, and then we’re in a lot of trouble.
This is just a quick write-up to clarify the extrapolation, not a comprehensive argument for all sides of the issue, or even this one side. I just don’t see this as much more presumptive or unreasonable than seeing trillion parameter models on the horizon during the age of 1-billion parameter models. I’d like to hear more about why you would make the latter leap but not the former?
“The internal machinery that could make an ASI dangerous is the same machinery that makes it work at all.”
If this were true, then it should also be true that humans who are highly capable at achieving their long-term goals are necessarily bad people that cause problems for everybody. But I’ve met lots of counterexamples, e.g., highly capable people who are also good. I’d be interested in seeing something empirical on this.
I think you misunderstood this point. The position is not that capabilities advances and misalignment are synonymous. The claim is that capabilities are value-neutral. It matters how you use them! And currently we’re not sure how to get superhuman AIs to robustly use their capabilities for good. That’s the problem.
The human equivalent would be to say “Competent evil people, and competent good people, share the trait of competence.” It’s not that all powerful things are evil, just that if you’re looking out for evil things, the powerful ones are especially good to stay wary of (and you may not really know someone’s goals from their actions if they happen to be pretty bad at getting what they want, making it hard to tell how good or evil they might be without consulting other variables).
I honestly don’t feel qualified to touch the ML stuff, and so won’t; sorry!
Because we’re concerned, specifically, with ASI — the kind of thing that can do this kind of thing — rather than LLMs which, if it turns out there’s very strong evidence that they won’t ever be able to do this kind of thing (so far there isn’t), and that they’re not likely to accelerate capabilities in other paradigms (e.g. by automating coding tasks) I think people in general would be much less worried about LLMs (there are likely other threat models from LLMs that ought to be addressed, as well, but these are the two kinds of things that spring to mind as ‘would make me personally much less worried about the things I’m worried about now’).
[other contributors likely disagree with me in various places]
The Palisade shut down paper shows that LLMs will in fact resist a shutdown order under certain conditions (this holds across a wide variety of contexts, and some are worse than others). It’s easy to imagine regimes in which it is hard to shut down an LLM (i.e. if your LLM hacks in ways that leverage a larger action space than one may otherwise expect). In particular, if you haven’t set up infrastructure in advance that makes it possible to, e.g., non-catastrophically cut power to a data center, guarantee that the model hasn’t exfiltrated itself, etc… a whole bunch of sorta adventurous scenarios.[1] The ability to actually shut down models that exhibit worrisome behavior basically does not exist at this point, and there are ways to give models kind of a lot of access to the external world (scaffolding), which it could, beyond some capabilities threshold, use to protect itself or evade shutdown. The capacity to protect ourselves against these dangerous capabilities is precisely the ‘off switch’ idea from the post. I read you as saying “just turn it off” and I respond by saying “Yeah, we want to be able to; there’s just literally not a way to do that at scale” and, especially not a way to do that at scale if the systems are doing anything at all to try and stop you (as they’ve shown some propensity for above).
This article purposely does not take a strong stance on whether we ought to be worried about GPT-N or some new AI paradigm developed with the help of GPT-N’s massive potential acceleration of software development. Many, for instance, agent foundations researchers believe that LLMs can’t/won’t scale to ASI for various reasons, but are still concerned about x-risk, and may still have pretty short timelines, owing to general acceleration toward superintelligence, precipitated by LLMs (either as coding assistants or stimulus for investment into the space or something else).
(to be clear, at least two (four, depending on who you ask) of the labs are now publicly aiming at superintelligence, which is the exact type of thing we think is a very bad idea; I introduce this just to say ‘it is not only doomers who think LLMs play a role in the development of smarter-than-human machines’, be those LLMs or some novel post-LLM architecture)
We see rapid capability development in strategically-relevant domains (math, science, coding, game-playing), and we see LLMs dipping their toes (at the very least) into concerning actions in experimental settings (and, to some extent, in real life). Seeing GPT-5 in GPT-2 because you’ve read about scaling laws doesn’t seem that different from:
Observing LLMs get good at stuff fast (if the techniques for making them good at that stuff are scalable)
Observing LLMs sometimes get good at things you don’t intend them to get good at (as a side-effect of you making them very good at other things intentionally)
Observing that the ceiling on LLM capabilities in narrow domains appears to be north of human level
Observing that LLMs are starting to get good at spooky dark arts-y stuff like lying and cheating and hacking and driving people insane (without us wanting them to!)
Conclude it seems likely that they continue getting better at those dark artsy things, good enough to outclass humans, and then we’re in a lot of trouble.
This is just a quick write-up to clarify the extrapolation, not a comprehensive argument for all sides of the issue, or even this one side. I just don’t see this as much more presumptive or unreasonable than seeing trillion parameter models on the horizon during the age of 1-billion parameter models. I’d like to hear more about why you would make the latter leap but not the former?
I think you misunderstood this point. The position is not that capabilities advances and misalignment are synonymous. The claim is that capabilities are value-neutral. It matters how you use them! And currently we’re not sure how to get superhuman AIs to robustly use their capabilities for good. That’s the problem.
The human equivalent would be to say “Competent evil people, and competent good people, share the trait of competence.” It’s not that all powerful things are evil, just that if you’re looking out for evil things, the powerful ones are especially good to stay wary of (and you may not really know someone’s goals from their actions if they happen to be pretty bad at getting what they want, making it hard to tell how good or evil they might be without consulting other variables).
I honestly don’t feel qualified to touch the ML stuff, and so won’t; sorry!
Because we’re concerned, specifically, with ASI — the kind of thing that can do this kind of thing — rather than LLMs which, if it turns out there’s very strong evidence that they won’t ever be able to do this kind of thing (so far there isn’t), and that they’re not likely to accelerate capabilities in other paradigms (e.g. by automating coding tasks) I think people in general would be much less worried about LLMs (there are likely other threat models from LLMs that ought to be addressed, as well, but these are the two kinds of things that spring to mind as ‘would make me personally much less worried about the things I’m worried about now’).