I take ‘backfire’ to mean ‘get more of the thing you don’t want than you would otherwise, as a direct result of your attempt to get less of it.’ If you mean it some other way, then the rest of my comment isn’t really useful.
Change of the winner
Secret projects under the moratorium are definitely on the list of things to watch out for, and the tech gov team at MIRI has a huge suite of countermeasures they’re considering for this, some of which are sketched out or gestured toward here.
It actually seems to me that an underground project is more likely under the current regime, because there aren’t really any meaningful controls in place (you might even consider DeepSeek just such a project, given that there’s some evidence [I really don’t know what I believe here and doesn’t seem useful to argue; just using them as an example here] that they stole IP and smuggled chips).
The better your moratorium is, the less likely you are to get wrecked by a secret project (because the fewer resources they’ll be able to gather) before you can satisfy your exit conditions.
So p(undergroundProjectExists) goes down as a result of moratorium legislation, but p(undergroundProjectWins) may go up if your moratorium sucks. (I actually think this is still pretty unclear, owing to the shape of the classified research ecosystem, which I talk more about below.)
This is, imo, your strongest point, and is a principal difference between the various MIRI plans and plans of other people we talk to (“Once you get the moratorium, do you suppose there must be a secret project and resolve to race against them?” MIRI answers no; some others answre yes.)
Intensified race...
You say: “a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning”. I don’t think this effect is stronger in the moratorium case than in the ‘we are losing the race and believe the finish line is near’ case, and this kind of behavior sooner is better (if we don’t expect the safety situation to improve), because the systems themselves are less powerful, the risks aren’t as big, the financial stakes aren’t as palpable, etc. I agree with something like “looming prohibition will cause some reckless action to happen sooner than it would otherwise”, but not with the stronger claim that this action would be created by the prohibition.
I also think the threat of a moratorium could cause companies to behave more sanely in various ways, so that they’re not caught on the wrong side of the law in the future worlds where some ‘moderate’ position wins the political debate. I don’t think voluntary commitments are trustworthy/sufficient, but I could absolutely see RSPs growing teeth as a way for companies to generate evidence of effective self-regulation, then deploy that evidence to argue against the necessity of a moratorium.
It’s just really not clear to me how this set of interrelated effects would net out, much less that it’s an obvious way pushing through a moratorium might backfire. My best guess is that these cooling effects pre-moratorium basically win out and compresses the period of greatest concern, while also reducing its intensity.
Various impairments for AI safety research
Huge amounts of classified research exists. There are entire parallel academic ecosystems for folks working on military and military-adjacent technologies. These include work on game theory, category theory, Conway’s Game of Life, genetics, corporate governance structures, and other relatively-esoteric things beyond ‘how make bomb go boom’. Scientists in ~every field and mathematicians in any branch of the discipline, can become DARPA PMs, and access to (some portion of) this separate academic canon is considered a central perk of the job. I expect gaining the ability to work on the classified parts of AI safety under a moratorium will be similarly difficult to qualifying for work at Los Alamos and the like.
As others have pointed out, not all kinds of research would need to be classified-by-default, under the plan. Mostly this would be stuff regarding architecture, algorithms, and hardware.
There are scarier worlds where you would want to classify more of the research, and there are reasonable disagreements about what should/shouldn’t be classified, but even then, you’re in a Los Alamos situation, and not in a Butlerian Jihad.
This all makes sense. I do think underground orgs exist; right now their chances of beating the leaders are not too high.
There are even “intermediate status orgs”; e.g. we know about Ilya’s org because he told us, but we don’t know much about it.
The post is for people to ponder how likely is all this to backfire in these (or other) fashions. I am not optimistic personally (all these estimates depend on the difficulty of achieving the ASI, and my personal estimates are that ASI is relatively easy to achieve, and that timelines are really short; the more tricky the task of creating ASI is, the stronger are the chances of such a plan not backfiring).
If one thinks about the original Eliezer’s threat model, someone launching a true recursive self-improvement in their garage… the only reason we don’t think about that much at the moment is because the high compute orgs are moving fast; the moment they stop, the original Eliezer’s threat model will start coming back into play more prominently.
I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.
ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!
ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.
ASI Easy/Doom Very Likely: We’re kinda fucked anyway in this world, so I’d want to have pretty high confidence it’s the world we’re in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we’re selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we’re in. I think this is irresponsibly fatalistic and also unlikely, but I don’t think it matters to get into it here.
ASI Hard/Doom Very Likely: Plan plausibly works.
I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it’s not obvious to me that they’re much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven’t looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.
I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.
Yeah, I think my position is ASI is easy/Doom likelihood is medium.
And, more specifically, I think that a good part of doom likelihood is due to people seeming to disagree super radically with each other about the details of anything related to AI existential safety.
This an empirical observation, I don’t have a good model for where this radical disagreement comes from; but this seeming inability to even start narrowing the disagreements down is not a good sign; first of all, this means that people are likely to keep disagreeing sharply with each other even when working together under the umbrella of a possible ban treaty.
So, without a ban, this seems to suggest that the views of the “race winners” on AI safety are very unpredictable (that’s not good, very difficult to predict what would happen). And with a ban, this seems to suggest high likelihood of ideologically driven rebellions against the ban (perhaps covert rather than overt, given the threat of the armed force).
With people being able to talk to each and converge to something, I would expect the doom likelihood to be reducible to more palatable levels. But without being able to narrow disagreements down somewhat, the doom chances have to be quite significant.
The book is neutral about timelines, or whether ASI is easy or not. It specifically calls those “hard issues” which it is going to sidestep.
So it would be difficult to make this a crux. The book implies that its recommendations don’t depend on one’s position on those “hard issues”.
If, instead, its recommendations were specific depending on the views on timelines and on the difficulty to achieve the ASI, that would be different.
As it is, the only specific dependency is the assumption that AI safety is very hard (in this sense, I did mention that I am speaking from the viewpoint of people who think it’s hard, but not very hard, the “10%-90% doom probability”).
I agree that humans are “disaster monkeys” poorly equipped to handle this, but I think they are also poorly equipped to handle the ban. They barely handle nuclear controls, and this one is way more difficult. I doubt the P(doom) conditional on the ban will be lower than P(doom) conditional on no ban. (Here I very much agree with Eliezer’s methodology of not talking of absolute values of P(doom), but of comparing conditional P(doom) values.)
For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).
Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.
I am agnostic. I don’t think humans necessarily need to modify the GPT architecture, I think GPT-6 would be perfectly capable of doing that for them.
But I also think that those “brains-in-the-box” systems will use (open weights or closed weights) LLMs as important components. It’s a different ball game now, we are more than halfway through towards the “true self-improvement”, because one can incorporate open weight LLMs (until the system progresses so much they can be phased out).
The leading LLMs systems are starting to provide real assistance in AI research + even their open versions are pretty good at being the components of the next big thing. So yes, this is purely from LLM trends (the labs insiders are tweeting that chatbots are saturated and that their current focus is how much LLMs can assist in AI research; and plenty of people express openness to modifying the architecture at will). I don’t know if we are going to continue to call them LLMs, but it does not matter, there is no strict boundary between LLMs and what is being built next.
I don’t want to continue to elaborate the technical details (what I am saying above is reasonably common knowledge, but if I start giving further technical details, I might say something actually useful for acceleration efforts).
But yes, I am saying that one should expect the trends to be faster than what follows from previous LLMs trends, because of how the labs are using them more and more for AI research. METR doubling periods should start shrinking soon.
I take ‘backfire’ to mean ‘get more of the thing you don’t want than you would otherwise, as a direct result of your attempt to get less of it.’ If you mean it some other way, then the rest of my comment isn’t really useful.
Change of the winner
Secret projects under the moratorium are definitely on the list of things to watch out for, and the tech gov team at MIRI has a huge suite of countermeasures they’re considering for this, some of which are sketched out or gestured toward here.
It actually seems to me that an underground project is more likely under the current regime, because there aren’t really any meaningful controls in place (you might even consider DeepSeek just such a project, given that there’s some evidence [I really don’t know what I believe here and doesn’t seem useful to argue; just using them as an example here] that they stole IP and smuggled chips).
The better your moratorium is, the less likely you are to get wrecked by a secret project (because the fewer resources they’ll be able to gather) before you can satisfy your exit conditions.
So p(undergroundProjectExists) goes down as a result of moratorium legislation, but p(undergroundProjectWins) may go up if your moratorium sucks. (I actually think this is still pretty unclear, owing to the shape of the classified research ecosystem, which I talk more about below.)
This is, imo, your strongest point, and is a principal difference between the various MIRI plans and plans of other people we talk to (“Once you get the moratorium, do you suppose there must be a secret project and resolve to race against them?” MIRI answers no; some others answre yes.)
Intensified race...
You say: “a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning”. I don’t think this effect is stronger in the moratorium case than in the ‘we are losing the race and believe the finish line is near’ case, and this kind of behavior sooner is better (if we don’t expect the safety situation to improve), because the systems themselves are less powerful, the risks aren’t as big, the financial stakes aren’t as palpable, etc. I agree with something like “looming prohibition will cause some reckless action to happen sooner than it would otherwise”, but not with the stronger claim that this action would be created by the prohibition.
I also think the threat of a moratorium could cause companies to behave more sanely in various ways, so that they’re not caught on the wrong side of the law in the future worlds where some ‘moderate’ position wins the political debate. I don’t think voluntary commitments are trustworthy/sufficient, but I could absolutely see RSPs growing teeth as a way for companies to generate evidence of effective self-regulation, then deploy that evidence to argue against the necessity of a moratorium.
It’s just really not clear to me how this set of interrelated effects would net out, much less that it’s an obvious way pushing through a moratorium might backfire. My best guess is that these cooling effects pre-moratorium basically win out and compresses the period of greatest concern, while also reducing its intensity.
Various impairments for AI safety research
Huge amounts of classified research exists. There are entire parallel academic ecosystems for folks working on military and military-adjacent technologies. These include work on game theory, category theory, Conway’s Game of Life, genetics, corporate governance structures, and other relatively-esoteric things beyond ‘how make bomb go boom’. Scientists in ~every field and mathematicians in any branch of the discipline, can become DARPA PMs, and access to (some portion of) this separate academic canon is considered a central perk of the job. I expect gaining the ability to work on the classified parts of AI safety under a moratorium will be similarly difficult to qualifying for work at Los Alamos and the like.
As others have pointed out, not all kinds of research would need to be classified-by-default, under the plan. Mostly this would be stuff regarding architecture, algorithms, and hardware.
There are scarier worlds where you would want to classify more of the research, and there are reasonable disagreements about what should/shouldn’t be classified, but even then, you’re in a Los Alamos situation, and not in a Butlerian Jihad.
This all makes sense. I do think underground orgs exist; right now their chances of beating the leaders are not too high.
There are even “intermediate status orgs”; e.g. we know about Ilya’s org because he told us, but we don’t know much about it.
The post is for people to ponder how likely is all this to backfire in these (or other) fashions. I am not optimistic personally (all these estimates depend on the difficulty of achieving the ASI, and my personal estimates are that ASI is relatively easy to achieve, and that timelines are really short; the more tricky the task of creating ASI is, the stronger are the chances of such a plan not backfiring).
If one thinks about the original Eliezer’s threat model, someone launching a true recursive self-improvement in their garage… the only reason we don’t think about that much at the moment is because the high compute orgs are moving fast; the moment they stop, the original Eliezer’s threat model will start coming back into play more prominently.
I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.
ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!
ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.
ASI Easy/Doom Very Likely: We’re kinda fucked anyway in this world, so I’d want to have pretty high confidence it’s the world we’re in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we’re selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we’re in. I think this is irresponsibly fatalistic and also unlikely, but I don’t think it matters to get into it here.
ASI Hard/Doom Very Likely: Plan plausibly works.
I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it’s not obvious to me that they’re much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven’t looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.
I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.
Yeah, I think my position is ASI is easy/Doom likelihood is medium.
And, more specifically, I think that a good part of doom likelihood is due to people seeming to disagree super radically with each other about the details of anything related to AI existential safety.
This an empirical observation, I don’t have a good model for where this radical disagreement comes from; but this seeming inability to even start narrowing the disagreements down is not a good sign; first of all, this means that people are likely to keep disagreeing sharply with each other even when working together under the umbrella of a possible ban treaty.
So, without a ban, this seems to suggest that the views of the “race winners” on AI safety are very unpredictable (that’s not good, very difficult to predict what would happen). And with a ban, this seems to suggest high likelihood of ideologically driven rebellions against the ban (perhaps covert rather than overt, given the threat of the armed force).
With people being able to talk to each and converge to something, I would expect the doom likelihood to be reducible to more palatable levels. But without being able to narrow disagreements down somewhat, the doom chances have to be quite significant.
I think targeting specific policy points rather than highlighting your actual crux makes this worse, not better
The book is neutral about timelines, or whether ASI is easy or not. It specifically calls those “hard issues” which it is going to sidestep.
So it would be difficult to make this a crux. The book implies that its recommendations don’t depend on one’s position on those “hard issues”.
If, instead, its recommendations were specific depending on the views on timelines and on the difficulty to achieve the ASI, that would be different.
As it is, the only specific dependency is the assumption that AI safety is very hard (in this sense, I did mention that I am speaking from the viewpoint of people who think it’s hard, but not very hard, the “10%-90% doom probability”).
I agree that humans are “disaster monkeys” poorly equipped to handle this, but I think they are also poorly equipped to handle the ban. They barely handle nuclear controls, and this one is way more difficult. I doubt the P(doom) conditional on the ban will be lower than P(doom) conditional on no ban. (Here I very much agree with Eliezer’s methodology of not talking of absolute values of P(doom), but of comparing conditional P(doom) values.)
For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).
Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.
I am agnostic. I don’t think humans necessarily need to modify the GPT architecture, I think GPT-6 would be perfectly capable of doing that for them.
But I also think that those “brains-in-the-box” systems will use (open weights or closed weights) LLMs as important components. It’s a different ball game now, we are more than halfway through towards the “true self-improvement”, because one can incorporate open weight LLMs (until the system progresses so much they can be phased out).
The leading LLMs systems are starting to provide real assistance in AI research + even their open versions are pretty good at being the components of the next big thing. So yes, this is purely from LLM trends (the labs insiders are tweeting that chatbots are saturated and that their current focus is how much LLMs can assist in AI research; and plenty of people express openness to modifying the architecture at will). I don’t know if we are going to continue to call them LLMs, but it does not matter, there is no strict boundary between LLMs and what is being built next.
I don’t want to continue to elaborate the technical details (what I am saying above is reasonably common knowledge, but if I start giving further technical details, I might say something actually useful for acceleration efforts).
But yes, I am saying that one should expect the trends to be faster than what follows from previous LLMs trends, because of how the labs are using them more and more for AI research. METR doubling periods should start shrinking soon.