In a world with an AI pause, good management is much more important than the pause length. If the pause is government-mandated and happens late, I would prefer a 1-year pause with say Buck Shlegeris in charge over a 3-year pause with the average AI lab lead in charge.
The reason is that most of the safety research during the pause will be done by AIs. In a pause scenario where >50% of resources are temporarily dedicated to safety and >10x uplift is technically plausible, the speed of research is mainly limited by the level of AI we trust and our ability to understand the AI’s findings. The world probably can’t pause indefinitely, and it may not be desirable to do so. So what is the basic strategy for a finite AI pause?
Use superhuman AIs as that are smart as possible to do safety research, but not so capable that they’re misaligned. This may require avoiding use of latest gen models, or advancing capabilities and using the next gen only for safety.
Use an elaborate monitoring framework and incrementally apply the safety research generated to increase the maximum capability level we can run at without being sabotaged when models are misaligned.
Prevent capability advances from being stolen, so no rogue actor can break out and threaten DSA, forcing the pause to end.
The variance in AI speed will be huge as models get smarter and more efficient. Conditional on us needing a pause, I imagine that the situation will look something like: Agent-2 speeds up work by 10x and incurs 0.2%/year x-risk, Agent-3 has 20x speed and 1.0%/year x-risk, Agent-4 has 40x speed and 5.0%/year x-risk, and so on. This means there is huge value in advancing capabilities in ways that that don’t reduce safety. For example, maybe alignment-specific training data doubles speed while preserving alignment, or a new architecture gets most of the benefits of “neuralese” without compromising monitorability.
I expect that in a pause, good vs bad management could make a >10x difference in rate of safety research progress, with some of this coming from targeting the appropriate risk level, some from better handling ambiguous evidence about safety of current models, and some from differentially advancing safe capabilities. In comparison, the amount of political will and diplomacy that will be required to make a pause 10x longer is immense.
Pause length is still valuable inasmuch as it gives more time for management to understand the situation, but is not the most important factor in whether the pause actually succeeds at solving alignment.
In a world with a fixed length pause initiated when AI is powerful enough to “help” with alignment research, I expect that immediately after the pause ends, everyone dies. This is the perspective from which a 3 year pause is 3 times as valuable as a 1 year pause, and doesn’t change much with management.
That is, you think alignment is so difficult that keeping humanity alive for 3 years is more valuable than the possibility of us solving alignment during the pause? Or that the AIs will sabotage the project in a way undetectable by management even if management is very paranoid about being sabotaged by any model that has shown prerequisite capabilities for it?
Conditional on the world deciding on a fixed length pause instead of a pause till the model is aligned, absolutely yes. Unconditionally, yes but with less confidence.
We can pause early (before AIs pose significant risk) at little cost
We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
Safety research inherently has to be done by humans because it’s less verifiable, even when capabilities is automated
AI lab CEOs are good at management of safety research because their capabilities experience transfers (in this case I’d still much prefer Buck Shlegeris or Sam Altman over the US or China governments)
It’s easy to pause indefinitely once everyone realizes AIs are imminently dangerous, kind of like the current situation with nuclear
We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
FYI I currently would mainline guess that this is true. Also I don’t get why current evidence says anything about it – current AIs aren’t dangerous, but that doesn’t really say anything about whether an AI that’s capable of speeding up superalignment or pivotal-act-relevant research by even 2x would be dangerous.
My view is that AIs are improving faster at research-relevant skills like SWE and math than they’re increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don’t have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I’m not sure about research taste as the benchmarks for that aren’t very good. I’d change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
“I would prefer a 1-year pause with say Buck Shlegeris in charge ” I think the best we can realistically do is something like a 1 year pause, and if done well gives a good chance of success. As you say 1 year with ~ASI will get a lot done. Some grand bargain where everyone pauses for one year in return for no more pauses perhaps. Unfortunately it will be extremely hard for some org not to push the “self optimize” button during this time however. Thats why I would rather as few as possible leading AI labs during this time. I would go so far as to say I would rather have 1 year like that than 100 years with current AI capabilities paused and alignment research progressed.
In a world with an AI pause, good management is much more important than the pause length. If the pause is government-mandated and happens late, I would prefer a 1-year pause with say Buck Shlegeris in charge over a 3-year pause with the average AI lab lead in charge.
The reason is that most of the safety research during the pause will be done by AIs. In a pause scenario where >50% of resources are temporarily dedicated to safety and >10x uplift is technically plausible, the speed of research is mainly limited by the level of AI we trust and our ability to understand the AI’s findings. The world probably can’t pause indefinitely, and it may not be desirable to do so. So what is the basic strategy for a finite AI pause?
Use superhuman AIs as that are smart as possible to do safety research, but not so capable that they’re misaligned. This may require avoiding use of latest gen models, or advancing capabilities and using the next gen only for safety.
Use an elaborate monitoring framework and incrementally apply the safety research generated to increase the maximum capability level we can run at without being sabotaged when models are misaligned.
Prevent capability advances from being stolen, so no rogue actor can break out and threaten DSA, forcing the pause to end.
The variance in AI speed will be huge as models get smarter and more efficient. Conditional on us needing a pause, I imagine that the situation will look something like: Agent-2 speeds up work by 10x and incurs 0.2%/year x-risk, Agent-3 has 20x speed and 1.0%/year x-risk, Agent-4 has 40x speed and 5.0%/year x-risk, and so on. This means there is huge value in advancing capabilities in ways that that don’t reduce safety. For example, maybe alignment-specific training data doubles speed while preserving alignment, or a new architecture gets most of the benefits of “neuralese” without compromising monitorability.
I expect that in a pause, good vs bad management could make a >10x difference in rate of safety research progress, with some of this coming from targeting the appropriate risk level, some from better handling ambiguous evidence about safety of current models, and some from differentially advancing safe capabilities. In comparison, the amount of political will and diplomacy that will be required to make a pause 10x longer is immense.
Pause length is still valuable inasmuch as it gives more time for management to understand the situation, but is not the most important factor in whether the pause actually succeeds at solving alignment.
In a world with a fixed length pause initiated when AI is powerful enough to “help” with alignment research, I expect that immediately after the pause ends, everyone dies. This is the perspective from which a 3 year pause is 3 times as valuable as a 1 year pause, and doesn’t change much with management.
That is, you think alignment is so difficult that keeping humanity alive for 3 years is more valuable than the possibility of us solving alignment during the pause? Or that the AIs will sabotage the project in a way undetectable by management even if management is very paranoid about being sabotaged by any model that has shown prerequisite capabilities for it?
Conditional on the world deciding on a fixed length pause instead of a pause till the model is aligned, absolutely yes. Unconditionally, yes but with less confidence.
Ways this could be wrong:
We can pause early (before AIs pose significant risk) at little cost
We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
Safety research inherently has to be done by humans because it’s less verifiable, even when capabilities is automated
AI lab CEOs are good at management of safety research because their capabilities experience transfers (in this case I’d still much prefer Buck Shlegeris or Sam Altman over the US or China governments)
It’s easy to pause indefinitely once everyone realizes AIs are imminently dangerous, kind of like the current situation with nuclear
Probably others I’m not thinking of
FYI I currently would mainline guess that this is true. Also I don’t get why current evidence says anything about it – current AIs aren’t dangerous, but that doesn’t really say anything about whether an AI that’s capable of speeding up superalignment or pivotal-act-relevant research by even 2x would be dangerous.
My view is that AIs are improving faster at research-relevant skills like SWE and math than they’re increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don’t have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I’m not sure about research taste as the benchmarks for that aren’t very good. I’d change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
Which is known to contain puzzles.
“I would prefer a 1-year pause with say Buck Shlegeris in charge ” I think the best we can realistically do is something like a 1 year pause, and if done well gives a good chance of success. As you say 1 year with ~ASI will get a lot done. Some grand bargain where everyone pauses for one year in return for no more pauses perhaps.
Unfortunately it will be extremely hard for some org not to push the “self optimize” button during this time however. Thats why I would rather as few as possible leading AI labs during this time.
I would go so far as to say I would rather have 1 year like that than 100 years with current AI capabilities paused and alignment research progressed.