A group of people from MIRI have published a mostly good introduction to
the dangers of AI: The
Problem.
It is a step forward at improving the discussion of catastrophic risks
from AI.
I agree with much of what MIRI writes there. I strongly agree with their
near-term policy advice of prioritizing the creation of an off switch.
I somewhat disagree with their advice to halt (for a long time) progress
toward ASI. We ought to make preparations in case a halt turns out to be
important. But most of my hopes route through strategies that don’t
need a halt.
A halt is both expensive and risky.
My biggest difference with MIRI is about how hard it is to adequately
align an AI. Some related differences involve the idea of a pivotal act,
and the expectation of a slippery slope between human-level AI and ASI.
Important Agreement
There isn’t a ceiling at human-level capabilities.
This is an important truth, that many people reject because they want it
not to be true.
It would be lethally dangerous to build ASIs that have the wrong
goals.
The default outcome if we’re careless about those goals might well be
that AIs conquer humans.
If you train a tiger not to eat you, you haven’t made it share your
desire to survive and thrive, with a full understanding of what that
means to you.
This is a good way to frame a key part of MIRI’s concern. We should be
worried that current AI company strategies look somewhat like this. But
the way that we train dogs seems like a slightly better analogy for how
AI training is likely to work in a few years. That’s not at all
sufficient by itself for us to be safe, but it has a much better track
record for generalized loyalty than training tigers.
Can We Stop Near Human-Level?
The development of systems with human-level generality is likely to
quickly result in artificial superintelligence (ASI)
This seems true for weak meanings of “likely” or “quickly”. That is
enough to scare me. But MIRI hints at a near-inevitability (or slippery
slope) that I don’t accept.
I predict that it will become easier to halt AI development as AI
reaches human levels, and continue to get easier for a bit after that.
(But probably not easy enough that we can afford to become complacent.)
Let’s imagine that someone produces an AI that is roughly as
intellectually capable as Elon Musk. Is it going to prioritize building
a smarter AI? I expect it will be more capable of evaluating the risks
than MIRI is today, due in part to it having better evidence than is
available today about the goals of that smarter AI. If it agrees with
MIRI’s assessment of the risk, wouldn’t it warn (or sabotage?)
developers instead? Note that this doesn’t require the Musk-level AI to
be aligned with humans—it could be afraid that the smarter AI would be
unaligned with the Musk-level AI’s goals.
There are a number of implicit ifs in that paragraph, such as if
progress produces a Musk-level AI before producing an ASI. But I don’t
think my weak optimism here requires anything far-fetched. Even if the
last AI before we reach ASI is less capable than Musk, it will have
significant understanding of the risks, and will likely have a good
enough track record that developers will listen to its concerns.
[How hard would it be to require AI companies to regularly ask their
best AIs how risky it is to build their next AI?]
I suspect that some of the sense of inevitability comes from the
expectation that the arguments for a halt are as persuasive now as they
will ever be.
On the contrary, I see at least half the difficulty in slowing progress
toward ASI is due to the average voter and average politician believing
that AI progress is mostly hype. Even superforecasters have tended to
dismiss AI progress as
hype.
I’m about 85% confident that before we get an AI capable of world
conquest, we’ll have an AI that is capable of convincing most voters
that AI is powerful enough to be a bigger concern than nuclear weapons.
MIRI is focused here on dispelling the illusion that it will be
technologically hard to speed past human intelligence levels. The main
point of my line of argument is that we should expect some changes in
willingness to accelerate, hopefully influenced by better analyses of
the risks.
I’m unsure whether this makes much difference for our strategy. It’s
hard enough to halt AI progress that we’re more likely to achieve it
just in the nick of time than too early. The main benefit of thinking
about doing a halt when AI is slightly better than human is that it
opens up better possibilities for enforcing the halt than we’ll
envision if we imagine that the only time for a halt is before AI
reaches human levels.
I’m reminded of the saying “You can always count on Americans to do
the right thing—after they’ve tried everything else.”
Alignment difficulty
MIRI’s advice depends somewhat heavily on the belief that we’re not at
all close to solving alignment. Whereas I’m about 70% confident that we
already have the basic ideas needed for alignment, and that a large
fraction of the remaining difficulty involves distinguishing the good
ideas from the bad ones, and assembling as many of the good ideas as we
can afford into an organized strategy. (I don’t think this is out of
line with expert opinion on the subject. However, the large range of
expert opinions on this subject worries me a good deal.)
[The Problem delegates most discussion of alignment difficulty to the
AGI Ruin page, which is a slightly
improved version of Eliezer’s AGI Ruin: A List of
Lethalities. This
section of my post is mostly a reply to that. ]
Corrigibility is anti-natural to consequentialist reasoning
No! It only looks that way because you’ve tried to combine
corrigibility with a conflicting utility function.
The second course is to build corrigible AGI which doesn’t want
exactly what we want, and yet somehow fails to kill us and take over
the galaxies despite that being a convergent incentive there.
You’re trying to take a system implicitly trained on lots of
arithmetic problems until its machinery started to reflect the common
coherent core of arithmetic, and get it to say that as a special case
222 + 222 = 555.
That describes some attempts at corrigibility, in particular those which
give the AI additional goals that are not sub-goals of corrigibility.
Max Harms’ CAST avoids
this mistake.
Corrigibility creates a basin of
attraction
that increases the likelihood of getting a good enough result on the
first try, and mitigates MIRI’s concerns about generalizing out of
distribution.
There are still plenty of thorny implementation details, and concerns
about who should be allowed to influence a corrigible AGI. But it’s
hard to see how a decade of further research would produce new insights
that can’t be found sooner.
Another way that we might be close to understanding how to create a safe
ASI is Drexler’s
CAIS.
Which roughly means keeping AI goals very short-term and tool-like.
I’m guessing that MIRI’s most plausible objection is that AIs created
this way wouldn’t be powerful enough to defend us against more agentic
AIs that are likely to be created. MIRI is probably wrong about that
defense, due to some false assumptions about some of the relevant
coordination problems.
MIRI often talks about pivotal acts such as melting all GPUs. I expect
defense against bad AIs to come from pivotal
processes
that focus on persuasion and negotiation, and to require weaker
capabilities than what’s needed for melting GPUs. Such pivotal
processes should be feasible earlier than I’d expect an AI to be able
to melt GPUs.
How does defending against bad AI with the aid of human-level CAIS
compare to MIRI’s plan to defend by halting AI progress earlier? Either
way, I expect the solution to involve active enforcement by leading
governments.
The closer the world gets to ASI, the better surveillance is needed to
detect and respond to dangers. And maybe more regulatory power is
needed. But I expect AI to increasingly help with those problems, such
that pivotal processes which focus on global agreements to halt certain
research become easier. I don’t see a clear dividing line between
proposals for a halt now, and the pivotal processes that would defend us
at a later stage.
I’ll guess that MIRI disagrees, likely due to assigning a much higher
probability than I do to a large leap in AI capabilities, producing a
world conquering agent before human-level CAIS has enough time to
implement defenses.
The CAIS strategy is still rather tricky to implement. CAIS development
won’t automatically outpace the development of agentic AI. So we’ll
need either some regulation, or a further fire alarm that causes AI
companies to become much more cautious.
It is tricky to enforce a rule that prohibits work on more agentic AIs,
but I expect that CAIS systems of 2027 will be wise enough to do much of
the needed evaluations of whether particular work violates such a rule.
Corrigibility and CAIS are the two clearest reasons why I’m cautiously
optimistic that non-catastrophic ASI is no harder than the Manhattan and
Apollo projects. Those two reasons make up maybe half of my reasoning
here. I’ve focused on them because the other reasons involve a much
wider range of weaker arguments that are harder to articulate.
Alas, there’s a large gap between someone knowing the correct pieces of
a safe approach to AI, and AI companies implementing them. Little in
current AI company practices inspires confidence in their ability to
make the right choices.
Conclusion
Parts of The Problem are unrealistically pessimistic. Yet the valid
parts of their argument are robust enough to justify being half as
concerned as they are. My policy advice overlaps a fair amount with
MIRI’s advice:
Creating an off switch should be the most urgent policy task.
Secondly, require AI companies to regularly ask their best AIs how risky
it is to create their next AI. Even if it only helps a little, the cost
/ benefit ratio ought to be great.
Policy experts ought to be preparing for ways to significantly slow or
halt a subset of AI development for several years. Ideally this should
focus on restricting agentic AI, while exempting CAIS. The political
climate has a decent chance of becoming ripe for this before the end of
the decade. The timing is likely to depend heavily on what accidents AIs
cause.
The details of such a halt should depend somewhat on advice given by AIs
near the time of the halt.
None of these options are as safe as I would like.
A halt carries its own serious risks: black market development without
safety constraints, and the possibility that when development resumes it
will be faster and less careful than continued cautious progress would
have been. [These concerns deserve their own post, but briefly: halts
are unstable equilibria that may make eventual development more
dangerous rather than less.]
When I started to write this post, I planned to conclude that I mostly
agreed with MIRI’s policy advice. But now I’ve decided that the
structural similarities are masking a dramatic difference in expected
cost. I anticipate that the tech industry will fight MIRI’s version
much more strongly than they will resist mine. That leaves me with
conflicting feelings about whether to treat MIRI’s position as allied
versus opposed to mine.
I expect that as we get more experience with advanced AIs, we will get
more information that is relevant to deciding whether a halt is
desirable. Let’s not commit ourselves so strongly on any particular
policy that we can’t change our minds in response to new evidence.
P.S. I asked Gemini2.5pro to guess how Eliezer would react to Max
Harms’ CAST. It was sufficiently confused about CAST that I gave up -
it imagined that the key advantage was that the AI had a narrow goal.
Claude Opus 4.1 did better—I needed to correct one misunderstanding of
CAST, then it gave some non-embarrassing guesses.
[Cross-posted from my blog.]
A group of people from MIRI have published a mostly good introduction to the dangers of AI: The Problem. It is a step forward at improving the discussion of catastrophic risks from AI.
I agree with much of what MIRI writes there. I strongly agree with their near-term policy advice of prioritizing the creation of an off switch.
I somewhat disagree with their advice to halt (for a long time) progress toward ASI. We ought to make preparations in case a halt turns out to be important. But most of my hopes route through strategies that don’t need a halt.
A halt is both expensive and risky.
My biggest difference with MIRI is about how hard it is to adequately align an AI. Some related differences involve the idea of a pivotal act, and the expectation of a slippery slope between human-level AI and ASI.
Important Agreement
This is an important truth, that many people reject because they want it not to be true.
The default outcome if we’re careless about those goals might well be that AIs conquer humans.
This is a good way to frame a key part of MIRI’s concern. We should be worried that current AI company strategies look somewhat like this. But the way that we train dogs seems like a slightly better analogy for how AI training is likely to work in a few years. That’s not at all sufficient by itself for us to be safe, but it has a much better track record for generalized loyalty than training tigers.
Can We Stop Near Human-Level?
This seems true for weak meanings of “likely” or “quickly”. That is enough to scare me. But MIRI hints at a near-inevitability (or slippery slope) that I don’t accept.
I predict that it will become easier to halt AI development as AI reaches human levels, and continue to get easier for a bit after that. (But probably not easy enough that we can afford to become complacent.)
Let’s imagine that someone produces an AI that is roughly as intellectually capable as Elon Musk. Is it going to prioritize building a smarter AI? I expect it will be more capable of evaluating the risks than MIRI is today, due in part to it having better evidence than is available today about the goals of that smarter AI. If it agrees with MIRI’s assessment of the risk, wouldn’t it warn (or sabotage?) developers instead? Note that this doesn’t require the Musk-level AI to be aligned with humans—it could be afraid that the smarter AI would be unaligned with the Musk-level AI’s goals.
There are a number of implicit ifs in that paragraph, such as if progress produces a Musk-level AI before producing an ASI. But I don’t think my weak optimism here requires anything far-fetched. Even if the last AI before we reach ASI is less capable than Musk, it will have significant understanding of the risks, and will likely have a good enough track record that developers will listen to its concerns.
[How hard would it be to require AI companies to regularly ask their best AIs how risky it is to build their next AI?]
I suspect that some of the sense of inevitability comes from the expectation that the arguments for a halt are as persuasive now as they will ever be.
On the contrary, I see at least half the difficulty in slowing progress toward ASI is due to the average voter and average politician believing that AI progress is mostly hype. Even superforecasters have tended to dismiss AI progress as hype.
I’m about 85% confident that before we get an AI capable of world conquest, we’ll have an AI that is capable of convincing most voters that AI is powerful enough to be a bigger concern than nuclear weapons.
MIRI is focused here on dispelling the illusion that it will be technologically hard to speed past human intelligence levels. The main point of my line of argument is that we should expect some changes in willingness to accelerate, hopefully influenced by better analyses of the risks.
I’m unsure whether this makes much difference for our strategy. It’s hard enough to halt AI progress that we’re more likely to achieve it just in the nick of time than too early. The main benefit of thinking about doing a halt when AI is slightly better than human is that it opens up better possibilities for enforcing the halt than we’ll envision if we imagine that the only time for a halt is before AI reaches human levels.
I’m reminded of the saying “You can always count on Americans to do the right thing—after they’ve tried everything else.”
Alignment difficulty
MIRI’s advice depends somewhat heavily on the belief that we’re not at all close to solving alignment. Whereas I’m about 70% confident that we already have the basic ideas needed for alignment, and that a large fraction of the remaining difficulty involves distinguishing the good ideas from the bad ones, and assembling as many of the good ideas as we can afford into an organized strategy. (I don’t think this is out of line with expert opinion on the subject. However, the large range of expert opinions on this subject worries me a good deal.)
[The Problem delegates most discussion of alignment difficulty to the AGI Ruin page, which is a slightly improved version of Eliezer’s AGI Ruin: A List of Lethalities. This section of my post is mostly a reply to that. ]
No! It only looks that way because you’ve tried to combine corrigibility with a conflicting utility function.
That describes some attempts at corrigibility, in particular those which give the AI additional goals that are not sub-goals of corrigibility. Max Harms’ CAST avoids this mistake.
Corrigibility creates a basin of attraction that increases the likelihood of getting a good enough result on the first try, and mitigates MIRI’s concerns about generalizing out of distribution.
There are still plenty of thorny implementation details, and concerns about who should be allowed to influence a corrigible AGI. But it’s hard to see how a decade of further research would produce new insights that can’t be found sooner.
Another way that we might be close to understanding how to create a safe ASI is Drexler’s CAIS. Which roughly means keeping AI goals very short-term and tool-like.
I’m guessing that MIRI’s most plausible objection is that AIs created this way wouldn’t be powerful enough to defend us against more agentic AIs that are likely to be created. MIRI is probably wrong about that defense, due to some false assumptions about some of the relevant coordination problems.
MIRI often talks about pivotal acts such as melting all GPUs. I expect defense against bad AIs to come from pivotal processes that focus on persuasion and negotiation, and to require weaker capabilities than what’s needed for melting GPUs. Such pivotal processes should be feasible earlier than I’d expect an AI to be able to melt GPUs.
How does defending against bad AI with the aid of human-level CAIS compare to MIRI’s plan to defend by halting AI progress earlier? Either way, I expect the solution to involve active enforcement by leading governments.
The closer the world gets to ASI, the better surveillance is needed to detect and respond to dangers. And maybe more regulatory power is needed. But I expect AI to increasingly help with those problems, such that pivotal processes which focus on global agreements to halt certain research become easier. I don’t see a clear dividing line between proposals for a halt now, and the pivotal processes that would defend us at a later stage.
I’ll guess that MIRI disagrees, likely due to assigning a much higher probability than I do to a large leap in AI capabilities, producing a world conquering agent before human-level CAIS has enough time to implement defenses.
The CAIS strategy is still rather tricky to implement. CAIS development won’t automatically outpace the development of agentic AI. So we’ll need either some regulation, or a further fire alarm that causes AI companies to become much more cautious.
It is tricky to enforce a rule that prohibits work on more agentic AIs, but I expect that CAIS systems of 2027 will be wise enough to do much of the needed evaluations of whether particular work violates such a rule.
Corrigibility and CAIS are the two clearest reasons why I’m cautiously optimistic that non-catastrophic ASI is no harder than the Manhattan and Apollo projects. Those two reasons make up maybe half of my reasoning here. I’ve focused on them because the other reasons involve a much wider range of weaker arguments that are harder to articulate.
Alas, there’s a large gap between someone knowing the correct pieces of a safe approach to AI, and AI companies implementing them. Little in current AI company practices inspires confidence in their ability to make the right choices.
Conclusion
Parts of The Problem are unrealistically pessimistic. Yet the valid parts of their argument are robust enough to justify being half as concerned as they are. My policy advice overlaps a fair amount with MIRI’s advice:
Creating an off switch should be the most urgent policy task.
Secondly, require AI companies to regularly ask their best AIs how risky it is to create their next AI. Even if it only helps a little, the cost / benefit ratio ought to be great.
Policy experts ought to be preparing for ways to significantly slow or halt a subset of AI development for several years. Ideally this should focus on restricting agentic AI, while exempting CAIS. The political climate has a decent chance of becoming ripe for this before the end of the decade. The timing is likely to depend heavily on what accidents AIs cause.
The details of such a halt should depend somewhat on advice given by AIs near the time of the halt.
None of these options are as safe as I would like.
A halt carries its own serious risks: black market development without safety constraints, and the possibility that when development resumes it will be faster and less careful than continued cautious progress would have been. [These concerns deserve their own post, but briefly: halts are unstable equilibria that may make eventual development more dangerous rather than less.]
When I started to write this post, I planned to conclude that I mostly agreed with MIRI’s policy advice. But now I’ve decided that the structural similarities are masking a dramatic difference in expected cost. I anticipate that the tech industry will fight MIRI’s version much more strongly than they will resist mine. That leaves me with conflicting feelings about whether to treat MIRI’s position as allied versus opposed to mine.
I expect that as we get more experience with advanced AIs, we will get more information that is relevant to deciding whether a halt is desirable. Let’s not commit ourselves so strongly on any particular policy that we can’t change our minds in response to new evidence.
P.S. I asked Gemini2.5pro to guess how Eliezer would react to Max Harms’ CAST. It was sufficiently confused about CAST that I gave up - it imagined that the key advantage was that the AI had a narrow goal. Claude Opus 4.1 did better—I needed to correct one misunderstanding of CAST, then it gave some non-embarrassing guesses.