This is a continuation of my review of IABIED. It’s intended for
audiences who already know a lot about AI risk debates. Please at least
glance at my main layman-oriented
review
before reading this.
Eliezer and Nate used to argue about AI risk using a paradigm that
involved a pretty sudden foom, and which viewed values through a utility
function lens. I’ll call that the MIRI paradigm (note: I don’t have a
comprehensive description of the paradigm). In IABIED, they’ve tried to
adopt a much broader paradigm that’s somewhat closer to that of more
mainstream AI researchers. Yet they keep sounding to me like they’re
still thinking within the MIRI paradigm.
Predicting Progress
More broadly, IABIED wants us to believe that progress has been slow in
AI safety. From IABIED’s online
resources:
When humans try to solve the AI alignment problem directly … the
solutions discussed tend to involve understanding a lot more about
intelligence and how to craft it, or craft critical components of it.
That’s an endeavor that human scientists have made only a small amount
of progress on over the past seventy years. The kinds of AIs that can
pull off a feat like that are the kinds of AIs that are smart enough
to be dangerous, strategic, and deceptive. This high level of
difficulty makes it extremely unlikely that researchers would be able
to tell correct solutions from incorrect ones, or tell honest
solutions apart from traps.
I’m not at all convinced that the MIRI paradigm enables a useful view
of progress, or a useful prediction of the kinds of insights that are
needed for further progress.
I asked a few other folks at UAI who had been in the field for twenty
years to estimate the same things, and they roughly agreed—about
5-10% of the distance has been covered in that time, without
noticeable acceleration. It would be useful to survey senior experts
in other areas of AI, to get related estimates for their areas. If
this 5-10% estimate is typical, as I suspect it is, then an outside
view calculation suggests we probably have at least a century to go,
and maybe a great many centuries, at current rates of progress.
Robin wrote that just as neural network enthusiasts, who mostly gathered
at a different conference, were noticing an important
acceleration.
I expect that Robin is correct that the research on which he based his
outside view would have taken at least a century to reach human levels.
Or as Claude puts it:
The real problem wasn’t just selection bias but paradigm
blindness. UAI folks were measuring progress toward human-level
reasoning via elegant probabilistic models. They didn’t anticipate
that “brute force” neural networks would essentially bypass their
entire research program.
The research that the IABIED authors consider to be very hard has a
similar flavor to the research that Robin was extrapolating.
Robin saw Pearl’s
Causality as an
important AI advance, whereas I don’t see it as part of AI. IABIED
seems to want more advances such as Logical
Induction, which I’m
guessing are similarly marginal to AI.
Superalignment
The parts of IABIED that ought to be most controversial are pages 189 to
191, where they reject the whole idea of superalignment that AI
companies are pursuing.
IABIED claims that AI companies use two different versions of the
superalignment idea: a weak one that only involves using AI to
understand whether an AI is aligned, and a strong one which uses AI to
do all the alignment work.
I’m disturbed to hear that those are what AI companies have in mind. I
hope there’s been some communication error here, but it seems fairly
plausible that the plans of AI companies are this sketchy.
Weak Superalignment
In the case of weak superalignment: We agree that a relatively
unintelligent AI could help with “interpretability research,” as
it’s called. But learning to read some of an AI’s mind is not a plan
for aligning it, any more than learning what’s going on inside atoms
is a plan for making a nuclear reactor that doesn’t melt down.
I find this analogy misleading. It reflects beliefs about what
innovations are needed for safety that are likely mistaken. I expect
that AI safety today is roughly at the stage that AI capabilities were
at around 2000 to 2010: much more than half of the basic ideas that are
needed are somewhat well known, but we’re missing the kind of evidence
that we need to confirm it, and a focus on the wrong paradigms makes it
hard to notice the best strategy.
In particular, it looks like we’re close enough to being able to
implement corrigibility that the largest obstacle involves being able to
observe how corrigible an AI is.
Interpretability is part of what we might need to distinguish paradigms.
Strong Superalignment
I’m not going to advocate strong superalignment as the best strategy,
but IABIED dismisses it too confidently. They say that we “can’t trust
an AI like that, before you’ve solved the alignment problem”.
There are a wide variety of possible AIs and methods of determining what
they can be trusted to do, in much the same way that there are a wide
variety of contexts in which humans can be trusted. I’m disappointed
that IABIED asserts an impossibility without much of an argument.
I searched around, and the MIRI paper Misalignment and
Catastrophe
has an argument that might reflect IABIED’s reasoning:
We argue that AIs must reach a certain level of goal-directedness and
general capability in order to do the tasks we are considering, and
that this is sufficient to cause catastrophe if the AI is misaligned.
The paper has a coherent model of why some tasks might require
dangerously powerful AI. Their intuitions about what kind of tasks are
needed bear little resemblance to my intuitions. Neither one of us seems
to be able to do much better than reference class tennis at explaining
why we disagree.
Another Version of Superalignment
Let me propose a semi-weak version of superalignment.
I like how IABIED divides intelligence into two concepts: prediction and
steering.
I see strong hints, from how AI has been developing over the past couple
of years, that there’s plenty of room for increasing the predictive
abilities of AI, without needing much increase in the AI’s desire to do anything other than predict.
Much of the progress (or do I mean movement in the wrong direction?)
toward more steering abilities has come from efforts that are
specifically intended to involve steering. My intuition says that AI
companies are capable of producing AIs that are better at prediction but
still as myopic as the more myopic of the current leading AIs.
What I want them to aim for is myopic AIs with prediction abilities that
are superhuman in some aspects, while keeping their steering abilities
weak.
I want to use those somewhat specialized AIs to make better than human
predictions about which AI strategies will produce what results. I
envision using AIs that have wildly superhuman abilities to integrate
large amounts of evidence, while being at most human-level in other
aspects of intelligence.
I foresee applying these AIs to problems of a narrower scope than
inventing something comparable to nuclear physics.
If we already have the ideas we need to safely reach moderately
superhuman AI (I’m maybe 75% confident that we do), then better
predictions plus interpretability are roughly what we need to
distinguish the most useful paradigm.
IABIED likens our situations to alchemists who are failing due to not
having invented nuclear physics. What I see in AI safety efforts
doesn’t look like the consistent failures of alchemy. It looks much
more like the problems faced by people who try to create an army that
won’t stage a coup. There are plenty of tests that yield moderately
promising evidence that the soldiers will usually obey civilian
authorities. There’s no big mystery about why soldiers might sometimes
disobey. The major problem is verifying how well the training
generalizes out of distribution.
IABIED sees hope in engineering humans to be better than Pearl or von
Neumann at generating new insights. Would Pearl+-level insights create
important advances in preventing military coups? I don’t know. But
it’s not at all the first strategy that comes to mind.
One possible crux with superalignment is whether AI progress can be
slowed at around the time that recursive improvement becomes possible.
IABIED presumably says no. I expect AIs of that time will provide
valuable advice. Maybe that advice will just be about how to shut down
AI development. But more likely it will convince us to adopt a good
paradigm that has been lingering in neglect.
In sum, IABIED dismisses superalignment much too quickly.
Heading to ASI?
I have some disagreement with IABIED’s claim about how quickly an ASI
could defeat humans. But that has little effect on how scared we ought
to be. The main uncertainty about whether an ASI could defeat humans is
about what kind of AI would be defending humans.
Our concern is for what comes after: machine intelligence that is
genuinely smart, smarter than any living human, smarter than humanity
collectively.
They define superintelligence (ASI) as an AI that is “a mind much more
capable than any human at almost every sort of steering and prediction
problem”. They seem to write as if “humanity collectively” and “any
human” were interchangeable. I want to keep those concepts distinct.
If they just meant smarter than the collective humanity of 2025, then I
agree that they have strong reason to think we’re on track to get there
in the 2030s. Prediction markets are forecasting a 50% chance of the
weaker meaning of ASI before
2035. I
find it easy to imagine that collective humanity of 2025 is sufficiently
weak that an ASI would eclipse it a few years after reaching that weak
meaning of ASI.
But IABIED presumably wants to predict an ASI that is smarter than the
collective humanity at the time the ASI is created. I predict that
humanity then will be significantly harder to defeat than the collective
humanity of 2025, due to defenses built with smarter-than-a-single-human
AI.
So I don’t share IABIED’s confidence that we’re on track to get
world-conquering AI. I expect it to be a close call as to whether
defenses improve fast enough.
IABIED denies that their arguments depend on fast takeoff, but I see
takeoff speed as having a large influence on how plausible their
arguments are. An ASI that achieves god-like powers within days of
reaching human-level intelligence would almost certainly be able to
conquer humanity. If AI takes a decade to go from human level to clearly
but not dramatically smarter than any human at every task, then it looks
unlikely that that ASI will be able to conquer humanity.
I expect takeoff to be fast enough that IABIED’s perspective here
won’t end up looking foolish, but will look overconfident.
I expect the likelihood of useful fire alarms to depend strongly on
takeoff speed. I see a better than 50% chance that a fire alarm will
significantly alter AI policy. Whereas if I believed in MIRI’s version
of foom, I’d put that at more like 5%.
Shard Theory vs Utility Functions
The MIRI paradigm models values as being produced by a utility function,
whereas some other researchers prefer shard theory. IABIED avoids
drawing attention to this, but it still has subtle widespread influences
on their reasoning.
Neither of these models are wrong. Each one is useful for a different
set of purposes. They nudge us toward different guesses about what
values AIs will have.
The utility function model of searching through the space of possible
utility functions increases the salience of alien minds.
Whereas shard theory comes closer to modeling the process that generated
human minds and which generates the values of existing AIs.
Note that there are strong arguments that AIs will want to adopt value
systems that qualify as utility functions as they approach god-like
levels of rationality. I don’t see that being very relevant to what I
see as the period of acute risk, when I expect the AIs to have
shard-like values that aren’t coherent enough to be usefully analyzed
as utility functions. Coherence might be
hard.
IABIED wants us to be certain that a small degree of misalignment will
be fatal. As Will MacAskill put
it:
I think Y&S often equivocate between two different concepts under the
idea of “misalignment”:
Imperfect alignment: The AI doesn’t always try to do what the
developer/user intended it to try to do. 2. Catastrophic misalignment:
The AI tries hard to disempower all humanity, insofar as it has the
opportunity.
I see a broader uncertainty. Many approaches to AI, including current
AIs, create conflicting goals. If such an AI becomes superhuman, it
seems likely to resolve those conflicts in ways that depend on many
details of how the AI works. Some of those resolution methods are likely
to work well. We should maybe pay more attention to whether we can
influence which methods an AI ends up using.
Asimov’s Three Laws illustrate a mistake that makes imperfect alignment
more catastrophic than it needs to be. He provided a clear rule for
resolving conflicts. Why did he explicitly give corrigibility a lower
priority than the First Law?
My guess is that imperfect alignment of current AIs would end up working
like a rough approximation of the moral
parliament.
I’m guessing, with very low confidence, that that means humans get a
slice of the lightcone.
Natural selection has difficulty finding the genes that cause a human
to keep deals only in the cases where our long-term reputation is a
major consideration. It was easier to just evolve an instinctive
distaste for lying and cheating.
All of the weird fiddly cases where humans sometimes keep a promise
even when it’s not actually beneficial to us are mainly evidence about
what sorts of emotions were most helpful in our tribal ancestral
environment while also being easy for evolution to encode into
genomes, rather than evidence about some universally useful cognitive
step.
This section convinced me that I’d been thoughtlessly overconfident
about making deals with AIs. I’ve now switched to being mildly
pessimistic about them keeping deals.
Yet the details of IABIED’s reasoning seem weird. I’m pretty sure that
human honor is mostly a cultural phenomenon, combined with a modest
amount of intelligent awareness of the value of reputation.
Cultural influences are more likely to be transmitted to AI values via
training data than are genetic influences.
But any such honor is likely to be at least mildly context dependent,
and the relevant context is novel enough to create much uncertainty.
More importantly, honor depends on incentives in complex ways. Once
again, IABIED’s conclusions seem pretty likely given foom (which
probably gives one AI a decisive advantage). Those conclusions seem
rather unlikely in a highly multipolar scenario resulting from a fairly
slow takeoff, where the AI needs to trade with a diverse set of other
AIs. (This isn’t a complete description of the controversial
assumptions that are needed in order to analyze this).
I predict fast enough takeoff that I’m worried that IABIED is correct
here. But I have trouble respecting claims of more than 80% confidence
here. IABIED sounds more than 80% confident.
I agree that if humans, with no further enhancement, build ASI, the
risks are unacceptably high. But we probably are on track for something
modestly different from that: ASI built by humans who have been
substantially enhanced by AIs that are almost superhuman.
I see AI companies as being reckless enough that we ought to be unsure
of their sanity. Whereas if I fully agreed with IABIED, I’d say
they’re insane enough that an ideal world would shut down and disband
those companies completely.
This post has probably sounded too confident. I’m pretty confident that
I’m pointing in the general direction of important flaws in IABIED’s
arguments. But I doubt that I’ve done an adequate job of clarifying
those flaws. I’m unsure how much of my disagreement is due to
communication errors, and how much is due to substantive confusion
somewhere.
IABIED is likely correct that following their paradigm would require
decades of research in order to produce much progress. I’m mildly
pessimistic that a shutdown which starts soon could be enforced for
decades. IABIED didn’t do much to explain why the MIRI paradigm is good
enough that we should be satisfied with it.
Those are some of the reasons why I consider it important to look for
other paradigms. Alas, I only have bits and pieces of a good paradigm in
mind.
It feels like any paradigm needs to make some sort of bet on the speed
of takeoff. Slow takeoff implies a rather different set of risks than
does foom. It doesn’t look like we can find a paradigm that’s fully
appropriate for all takeoff speeds. That’s an important part of why
it’s hard to find the right paradigm. Getting this right is hard, in
part because takeoff speed can be influenced by whether there’s a race,
and by whether AIs recommend slowing down at key points.
I expect that a good paradigm would induce more researchers to focus on
corrigibility, whereas the current paradigms seem to cause neglect via
either implying that corrigibility is too easy to require much thought,
or too hard for an unenhanced human to tackle.
I’m 60% confident that the path to safety involves a focus on
corrigibility. Thinking clearly about corrigibility seems at most
Pearl-level hard. It likely still involves lots of stumbling about to
ask the right questions and recognize good answers when we see them.
I’m disappointed that IABIED doesn’t advocate efforts to separate AIs
goals from their world models, in order to make it easier to influence
those goals. Yann LeCun has cost module that is separate from the AI’s
world
model. It
would be ironic if LeCun ended up helping more than the AI leaders who
say they’re worried about safety.
I approve of IABIED’s attempt to write for a broader audience than one
that would accept the MIRI paradigm. It made the book a bit more
effective at raising awareness of AI risks. But it left a good deal of
confusion that an ideal book would have resolved.
P.S. - My AI-related investments did quite well in September, with the
result that my AI-heavy portfolio was up almost 20%. I’m unsure how
much of that is connected to the book—it’s not like there was a
sudden surprise on the day the book was published. But my intuition says
there was some sort of connection.
Eliezer continues to be more effective at persuading people that AI will
be powerful, than at increasing people’s p(doom). But p(powerful) is
more important to update than p(doom), as long as your p(doom) doesn’t
round to zero.
Or maybe he did an excellent job of timing the book’s publication for
when the world was ready to awaken to AI’s power.
IABIED: Paradigm Confusion and Overconfidence
Link post
This is a continuation of my review of IABIED. It’s intended for audiences who already know a lot about AI risk debates. Please at least glance at my main layman-oriented review before reading this.
Eliezer and Nate used to argue about AI risk using a paradigm that involved a pretty sudden foom, and which viewed values through a utility function lens. I’ll call that the MIRI paradigm (note: I don’t have a comprehensive description of the paradigm). In IABIED, they’ve tried to adopt a much broader paradigm that’s somewhat closer to that of more mainstream AI researchers. Yet they keep sounding to me like they’re still thinking within the MIRI paradigm.
Predicting Progress
More broadly, IABIED wants us to believe that progress has been slow in AI safety. From IABIED’s online resources:
I’m not at all convinced that the MIRI paradigm enables a useful view of progress, or a useful prediction of the kinds of insights that are needed for further progress.
IABIED’s position reminds me of Robin Hanson’s AI Progress Estimate:
Robin wrote that just as neural network enthusiasts, who mostly gathered at a different conference, were noticing an important acceleration. I expect that Robin is correct that the research on which he based his outside view would have taken at least a century to reach human levels. Or as Claude puts it:
The research that the IABIED authors consider to be very hard has a similar flavor to the research that Robin was extrapolating.
Robin saw Pearl’s Causality as an important AI advance, whereas I don’t see it as part of AI. IABIED seems to want more advances such as Logical Induction, which I’m guessing are similarly marginal to AI.
Superalignment
The parts of IABIED that ought to be most controversial are pages 189 to 191, where they reject the whole idea of superalignment that AI companies are pursuing.
IABIED claims that AI companies use two different versions of the superalignment idea: a weak one that only involves using AI to understand whether an AI is aligned, and a strong one which uses AI to do all the alignment work.
I’m disturbed to hear that those are what AI companies have in mind. I hope there’s been some communication error here, but it seems fairly plausible that the plans of AI companies are this sketchy.
Weak Superalignment
I find this analogy misleading. It reflects beliefs about what innovations are needed for safety that are likely mistaken. I expect that AI safety today is roughly at the stage that AI capabilities were at around 2000 to 2010: much more than half of the basic ideas that are needed are somewhat well known, but we’re missing the kind of evidence that we need to confirm it, and a focus on the wrong paradigms makes it hard to notice the best strategy.
In particular, it looks like we’re close enough to being able to implement corrigibility that the largest obstacle involves being able to observe how corrigible an AI is.
Interpretability is part of what we might need to distinguish paradigms.
Strong Superalignment
I’m not going to advocate strong superalignment as the best strategy, but IABIED dismisses it too confidently. They say that we “can’t trust an AI like that, before you’ve solved the alignment problem”.
There are a wide variety of possible AIs and methods of determining what they can be trusted to do, in much the same way that there are a wide variety of contexts in which humans can be trusted. I’m disappointed that IABIED asserts an impossibility without much of an argument.
I searched around, and the MIRI paper Misalignment and Catastrophe has an argument that might reflect IABIED’s reasoning:
The paper has a coherent model of why some tasks might require dangerously powerful AI. Their intuitions about what kind of tasks are needed bear little resemblance to my intuitions. Neither one of us seems to be able to do much better than reference class tennis at explaining why we disagree.
Another Version of Superalignment
Let me propose a semi-weak version of superalignment.
I like how IABIED divides intelligence into two concepts: prediction and steering.
I see strong hints, from how AI has been developing over the past couple of years, that there’s plenty of room for increasing the predictive abilities of AI, without needing much increase in the AI’s desire to do anything other than predict.
Much of the progress (or do I mean movement in the wrong direction?) toward more steering abilities has come from efforts that are specifically intended to involve steering. My intuition says that AI companies are capable of producing AIs that are better at prediction but still as myopic as the more myopic of the current leading AIs.
What I want them to aim for is myopic AIs with prediction abilities that are superhuman in some aspects, while keeping their steering abilities weak.
I want to use those somewhat specialized AIs to make better than human predictions about which AI strategies will produce what results. I envision using AIs that have wildly superhuman abilities to integrate large amounts of evidence, while being at most human-level in other aspects of intelligence.
I foresee applying these AIs to problems of a narrower scope than inventing something comparable to nuclear physics.
If we already have the ideas we need to safely reach moderately superhuman AI (I’m maybe 75% confident that we do), then better predictions plus interpretability are roughly what we need to distinguish the most useful paradigm.
IABIED likens our situations to alchemists who are failing due to not having invented nuclear physics. What I see in AI safety efforts doesn’t look like the consistent failures of alchemy. It looks much more like the problems faced by people who try to create an army that won’t stage a coup. There are plenty of tests that yield moderately promising evidence that the soldiers will usually obey civilian authorities. There’s no big mystery about why soldiers might sometimes disobey. The major problem is verifying how well the training generalizes out of distribution.
IABIED sees hope in engineering humans to be better than Pearl or von Neumann at generating new insights. Would Pearl+-level insights create important advances in preventing military coups? I don’t know. But it’s not at all the first strategy that comes to mind.
One possible crux with superalignment is whether AI progress can be slowed at around the time that recursive improvement becomes possible. IABIED presumably says no. I expect AIs of that time will provide valuable advice. Maybe that advice will just be about how to shut down AI development. But more likely it will convince us to adopt a good paradigm that has been lingering in neglect.
In sum, IABIED dismisses superalignment much too quickly.
Heading to ASI?
I have some disagreement with IABIED’s claim about how quickly an ASI could defeat humans. But that has little effect on how scared we ought to be. The main uncertainty about whether an ASI could defeat humans is about what kind of AI would be defending humans.
They define superintelligence (ASI) as an AI that is “a mind much more capable than any human at almost every sort of steering and prediction problem”. They seem to write as if “humanity collectively” and “any human” were interchangeable. I want to keep those concepts distinct.
If they just meant smarter than the collective humanity of 2025, then I agree that they have strong reason to think we’re on track to get there in the 2030s. Prediction markets are forecasting a 50% chance of the weaker meaning of ASI before 2035. I find it easy to imagine that collective humanity of 2025 is sufficiently weak that an ASI would eclipse it a few years after reaching that weak meaning of ASI.
But IABIED presumably wants to predict an ASI that is smarter than the collective humanity at the time the ASI is created. I predict that humanity then will be significantly harder to defeat than the collective humanity of 2025, due to defenses built with smarter-than-a-single-human AI.
So I don’t share IABIED’s confidence that we’re on track to get world-conquering AI. I expect it to be a close call as to whether defenses improve fast enough.
IABIED denies that their arguments depend on fast takeoff, but I see takeoff speed as having a large influence on how plausible their arguments are. An ASI that achieves god-like powers within days of reaching human-level intelligence would almost certainly be able to conquer humanity. If AI takes a decade to go from human level to clearly but not dramatically smarter than any human at every task, then it looks unlikely that that ASI will be able to conquer humanity.
I expect takeoff to be fast enough that IABIED’s perspective here won’t end up looking foolish, but will look overconfident.
I expect the likelihood of useful fire alarms to depend strongly on takeoff speed. I see a better than 50% chance that a fire alarm will significantly alter AI policy. Whereas if I believed in MIRI’s version of foom, I’d put that at more like 5%.
Shard Theory vs Utility Functions
The MIRI paradigm models values as being produced by a utility function, whereas some other researchers prefer shard theory. IABIED avoids drawing attention to this, but it still has subtle widespread influences on their reasoning.
Neither of these models are wrong. Each one is useful for a different set of purposes. They nudge us toward different guesses about what values AIs will have.
The utility function model of searching through the space of possible utility functions increases the salience of alien minds.
Whereas shard theory comes closer to modeling the process that generated human minds and which generates the values of existing AIs.
Note that there are strong arguments that AIs will want to adopt value systems that qualify as utility functions as they approach god-like levels of rationality. I don’t see that being very relevant to what I see as the period of acute risk, when I expect the AIs to have shard-like values that aren’t coherent enough to be usefully analyzed as utility functions. Coherence might be hard.
IABIED wants us to be certain that a small degree of misalignment will be fatal. As Will MacAskill put it:
Max Harms explains a reason for uncertainty that makes sense within the MIRI paradigm (see the Types of misalignment section).
I see a broader uncertainty. Many approaches to AI, including current AIs, create conflicting goals. If such an AI becomes superhuman, it seems likely to resolve those conflicts in ways that depend on many details of how the AI works. Some of those resolution methods are likely to work well. We should maybe pay more attention to whether we can influence which methods an AI ends up using.
Asimov’s Three Laws illustrate a mistake that makes imperfect alignment more catastrophic than it needs to be. He provided a clear rule for resolving conflicts. Why did he explicitly give corrigibility a lower priority than the First Law?
My guess is that imperfect alignment of current AIs would end up working like a rough approximation of the moral parliament. I’m guessing, with very low confidence, that that means humans get a slice of the lightcone.
AIs Won’t Keep Their Promises
From AIs won’t keep their promises:
This section convinced me that I’d been thoughtlessly overconfident about making deals with AIs. I’ve now switched to being mildly pessimistic about them keeping deals.
Yet the details of IABIED’s reasoning seem weird. I’m pretty sure that human honor is mostly a cultural phenomenon, combined with a modest amount of intelligent awareness of the value of reputation.
Cultural influences are more likely to be transmitted to AI values via training data than are genetic influences.
But any such honor is likely to be at least mildly context dependent, and the relevant context is novel enough to create much uncertainty.
More importantly, honor depends on incentives in complex ways. Once again, IABIED’s conclusions seem pretty likely given foom (which probably gives one AI a decisive advantage). Those conclusions seem rather unlikely in a highly multipolar scenario resulting from a fairly slow takeoff, where the AI needs to trade with a diverse set of other AIs. (This isn’t a complete description of the controversial assumptions that are needed in order to analyze this).
I predict fast enough takeoff that I’m worried that IABIED is correct here. But I have trouble respecting claims of more than 80% confidence here. IABIED sounds more than 80% confident.
Conclusion
See also my Comments on MIRI’s The Problem for more thoughts about MIRI’s overconfidence. IABIED’s treatment of corrigibility confirms my pessimism about their ability to recognize progress toward safety.
I agree that if humans, with no further enhancement, build ASI, the risks are unacceptably high. But we probably are on track for something modestly different from that: ASI built by humans who have been substantially enhanced by AIs that are almost superhuman.
I see AI companies as being reckless enough that we ought to be unsure of their sanity. Whereas if I fully agreed with IABIED, I’d say they’re insane enough that an ideal world would shut down and disband those companies completely.
This post has probably sounded too confident. I’m pretty confident that I’m pointing in the general direction of important flaws in IABIED’s arguments. But I doubt that I’ve done an adequate job of clarifying those flaws. I’m unsure how much of my disagreement is due to communication errors, and how much is due to substantive confusion somewhere.
IABIED is likely correct that following their paradigm would require decades of research in order to produce much progress. I’m mildly pessimistic that a shutdown which starts soon could be enforced for decades. IABIED didn’t do much to explain why the MIRI paradigm is good enough that we should be satisfied with it.
Those are some of the reasons why I consider it important to look for other paradigms. Alas, I only have bits and pieces of a good paradigm in mind.
It feels like any paradigm needs to make some sort of bet on the speed of takeoff. Slow takeoff implies a rather different set of risks than does foom. It doesn’t look like we can find a paradigm that’s fully appropriate for all takeoff speeds. That’s an important part of why it’s hard to find the right paradigm. Getting this right is hard, in part because takeoff speed can be influenced by whether there’s a race, and by whether AIs recommend slowing down at key points.
I expect that a good paradigm would induce more researchers to focus on corrigibility, whereas the current paradigms seem to cause neglect via either implying that corrigibility is too easy to require much thought, or too hard for an unenhanced human to tackle.
I’m 60% confident that the path to safety involves a focus on corrigibility. Thinking clearly about corrigibility seems at most Pearl-level hard. It likely still involves lots of stumbling about to ask the right questions and recognize good answers when we see them.
I’m disappointed that IABIED doesn’t advocate efforts to separate AIs goals from their world models, in order to make it easier to influence those goals. Yann LeCun has cost module that is separate from the AI’s world model. It would be ironic if LeCun ended up helping more than the AI leaders who say they’re worried about safety.
I approve of IABIED’s attempt to write for a broader audience than one that would accept the MIRI paradigm. It made the book a bit more effective at raising awareness of AI risks. But it left a good deal of confusion that an ideal book would have resolved.
P.S. - My AI-related investments did quite well in September, with the result that my AI-heavy portfolio was up almost 20%. I’m unsure how much of that is connected to the book—it’s not like there was a sudden surprise on the day the book was published. But my intuition says there was some sort of connection.
Eliezer continues to be more effective at persuading people that AI will be powerful, than at increasing people’s p(doom). But p(powerful) is more important to update than p(doom), as long as your p(doom) doesn’t round to zero.
Or maybe he did an excellent job of timing the book’s publication for when the world was ready to awaken to AI’s power.