IABIED: Paradigm Confusion and Overconfidence

Link post

This is a continuation of my review of IABIED. It’s intended for audiences who already know a lot about AI risk debates. Please at least glance at my main layman-oriented review before reading this.

Eliezer and Nate used to argue about AI risk using a paradigm that involved a pretty sudden foom, and which viewed values through a utility function lens. I’ll call that the MIRI paradigm (note: I don’t have a comprehensive description of the paradigm). In IABIED, they’ve tried to adopt a much broader paradigm that’s somewhat closer to that of more mainstream AI researchers. Yet they keep sounding to me like they’re still thinking within the MIRI paradigm.

Predicting Progress

More broadly, IABIED wants us to believe that progress has been slow in AI safety. From IABIED’s online resources:

When humans try to solve the AI alignment problem directly … the solutions discussed tend to involve understanding a lot more about intelligence and how to craft it, or craft critical components of it.

That’s an endeavor that human scientists have made only a small amount of progress on over the past seventy years. The kinds of AIs that can pull off a feat like that are the kinds of AIs that are smart enough to be dangerous, strategic, and deceptive. This high level of difficulty makes it extremely unlikely that researchers would be able to tell correct solutions from incorrect ones, or tell honest solutions apart from traps.

I’m not at all convinced that the MIRI paradigm enables a useful view of progress, or a useful prediction of the kinds of insights that are needed for further progress.

IABIED’s position reminds me of Robin Hanson’s AI Progress Estimate:

I asked a few other folks at UAI who had been in the field for twenty years to estimate the same things, and they roughly agreed—about 5-10% of the distance has been covered in that time, without noticeable acceleration. It would be useful to survey senior experts in other areas of AI, to get related estimates for their areas. If this 5-10% estimate is typical, as I suspect it is, then an outside view calculation suggests we probably have at least a century to go, and maybe a great many centuries, at current rates of progress.

Robin wrote that just as neural network enthusiasts, who mostly gathered at a different conference, were noticing an important acceleration. I expect that Robin is correct that the research on which he based his outside view would have taken at least a century to reach human levels. Or as Claude puts it:

The real problem wasn’t just selection bias but paradigm blindness. UAI folks were measuring progress toward human-level reasoning via elegant probabilistic models. They didn’t anticipate that “brute force” neural networks would essentially bypass their entire research program.

The research that the IABIED authors consider to be very hard has a similar flavor to the research that Robin was extrapolating.

Robin saw Pearl’s Causality as an important AI advance, whereas I don’t see it as part of AI. IABIED seems to want more advances such as Logical Induction, which I’m guessing are similarly marginal to AI.

Superalignment

The parts of IABIED that ought to be most controversial are pages 189 to 191, where they reject the whole idea of superalignment that AI companies are pursuing.

IABIED claims that AI companies use two different versions of the superalignment idea: a weak one that only involves using AI to understand whether an AI is aligned, and a strong one which uses AI to do all the alignment work.

I’m disturbed to hear that those are what AI companies have in mind. I hope there’s been some communication error here, but it seems fairly plausible that the plans of AI companies are this sketchy.

Weak Superalignment

In the case of weak superalignment: We agree that a relatively unintelligent AI could help with “interpretability research,” as it’s called. But learning to read some of an AI’s mind is not a plan for aligning it, any more than learning what’s going on inside atoms is a plan for making a nuclear reactor that doesn’t melt down.

I find this analogy misleading. It reflects beliefs about what innovations are needed for safety that are likely mistaken. I expect that AI safety today is roughly at the stage that AI capabilities were at around 2000 to 2010: much more than half of the basic ideas that are needed are somewhat well known, but we’re missing the kind of evidence that we need to confirm it, and a focus on the wrong paradigms makes it hard to notice the best strategy.

In particular, it looks like we’re close enough to being able to implement corrigibility that the largest obstacle involves being able to observe how corrigible an AI is.

Interpretability is part of what we might need to distinguish paradigms.

Strong Superalignment

I’m not going to advocate strong superalignment as the best strategy, but IABIED dismisses it too confidently. They say that we “can’t trust an AI like that, before you’ve solved the alignment problem”.

There are a wide variety of possible AIs and methods of determining what they can be trusted to do, in much the same way that there are a wide variety of contexts in which humans can be trusted. I’m disappointed that IABIED asserts an impossibility without much of an argument.

I searched around, and the MIRI paper Misalignment and Catastrophe has an argument that might reflect IABIED’s reasoning:

We argue that AIs must reach a certain level of goal-directedness and general capability in order to do the tasks we are considering, and that this is sufficient to cause catastrophe if the AI is misaligned.

The paper has a coherent model of why some tasks might require dangerously powerful AI. Their intuitions about what kind of tasks are needed bear little resemblance to my intuitions. Neither one of us seems to be able to do much better than reference class tennis at explaining why we disagree.

Another Version of Superalignment

Let me propose a semi-weak version of superalignment.

I like how IABIED divides intelligence into two concepts: prediction and steering.

I see strong hints, from how AI has been developing over the past couple of years, that there’s plenty of room for increasing the predictive abilities of AI, without needing much increase in the AI’s desire to do anything other than predict.

Much of the progress (or do I mean movement in the wrong direction?) toward more steering abilities has come from efforts that are specifically intended to involve steering. My intuition says that AI companies are capable of producing AIs that are better at prediction but still as myopic as the more myopic of the current leading AIs.

What I want them to aim for is myopic AIs with prediction abilities that are superhuman in some aspects, while keeping their steering abilities weak.

I want to use those somewhat specialized AIs to make better than human predictions about which AI strategies will produce what results. I envision using AIs that have wildly superhuman abilities to integrate large amounts of evidence, while being at most human-level in other aspects of intelligence.

I foresee applying these AIs to problems of a narrower scope than inventing something comparable to nuclear physics.

If we already have the ideas we need to safely reach moderately superhuman AI (I’m maybe 75% confident that we do), then better predictions plus interpretability are roughly what we need to distinguish the most useful paradigm.

IABIED likens our situations to alchemists who are failing due to not having invented nuclear physics. What I see in AI safety efforts doesn’t look like the consistent failures of alchemy. It looks much more like the problems faced by people who try to create an army that won’t stage a coup. There are plenty of tests that yield moderately promising evidence that the soldiers will usually obey civilian authorities. There’s no big mystery about why soldiers might sometimes disobey. The major problem is verifying how well the training generalizes out of distribution.

IABIED sees hope in engineering humans to be better than Pearl or von Neumann at generating new insights. Would Pearl+-level insights create important advances in preventing military coups? I don’t know. But it’s not at all the first strategy that comes to mind.

One possible crux with superalignment is whether AI progress can be slowed at around the time that recursive improvement becomes possible. IABIED presumably says no. I expect AIs of that time will provide valuable advice. Maybe that advice will just be about how to shut down AI development. But more likely it will convince us to adopt a good paradigm that has been lingering in neglect.

In sum, IABIED dismisses superalignment much too quickly.

Heading to ASI?

I have some disagreement with IABIED’s claim about how quickly an ASI could defeat humans. But that has little effect on how scared we ought to be. The main uncertainty about whether an ASI could defeat humans is about what kind of AI would be defending humans.

Our concern is for what comes after: machine intelligence that is genuinely smart, smarter than any living human, smarter than humanity collectively.

They define superintelligence (ASI) as an AI that is “a mind much more capable than any human at almost every sort of steering and prediction problem”. They seem to write as if “humanity collectively” and “any human” were interchangeable. I want to keep those concepts distinct.

If they just meant smarter than the collective humanity of 2025, then I agree that they have strong reason to think we’re on track to get there in the 2030s. Prediction markets are forecasting a 50% chance of the weaker meaning of ASI before 2035. I find it easy to imagine that collective humanity of 2025 is sufficiently weak that an ASI would eclipse it a few years after reaching that weak meaning of ASI.

But IABIED presumably wants to predict an ASI that is smarter than the collective humanity at the time the ASI is created. I predict that humanity then will be significantly harder to defeat than the collective humanity of 2025, due to defenses built with smarter-than-a-single-human AI.

So I don’t share IABIED’s confidence that we’re on track to get world-conquering AI. I expect it to be a close call as to whether defenses improve fast enough.

IABIED denies that their arguments depend on fast takeoff, but I see takeoff speed as having a large influence on how plausible their arguments are. An ASI that achieves god-like powers within days of reaching human-level intelligence would almost certainly be able to conquer humanity. If AI takes a decade to go from human level to clearly but not dramatically smarter than any human at every task, then it looks unlikely that that ASI will be able to conquer humanity.

I expect takeoff to be fast enough that IABIED’s perspective here won’t end up looking foolish, but will look overconfident.

I expect the likelihood of useful fire alarms to depend strongly on takeoff speed. I see a better than 50% chance that a fire alarm will significantly alter AI policy. Whereas if I believed in MIRI’s version of foom, I’d put that at more like 5%.

Shard Theory vs Utility Functions

The MIRI paradigm models values as being produced by a utility function, whereas some other researchers prefer shard theory. IABIED avoids drawing attention to this, but it still has subtle widespread influences on their reasoning.

Neither of these models are wrong. Each one is useful for a different set of purposes. They nudge us toward different guesses about what values AIs will have.

The utility function model of searching through the space of possible utility functions increases the salience of alien minds.

Whereas shard theory comes closer to modeling the process that generated human minds and which generates the values of existing AIs.

Note that there are strong arguments that AIs will want to adopt value systems that qualify as utility functions as they approach god-like levels of rationality. I don’t see that being very relevant to what I see as the period of acute risk, when I expect the AIs to have shard-like values that aren’t coherent enough to be usefully analyzed as utility functions. Coherence might be hard.

IABIED wants us to be certain that a small degree of misalignment will be fatal. As Will MacAskill put it:

I think Y&S often equivocate between two different concepts under the idea of “misalignment”:

Imperfect alignment: The AI doesn’t always try to do what the developer/user intended it to try to do. 2. Catastrophic misalignment: The AI tries hard to disempower all humanity, insofar as it has the opportunity.

Max Harms explains a reason for uncertainty that makes sense within the MIRI paradigm (see the Types of misalignment section).

I see a broader uncertainty. Many approaches to AI, including current AIs, create conflicting goals. If such an AI becomes superhuman, it seems likely to resolve those conflicts in ways that depend on many details of how the AI works. Some of those resolution methods are likely to work well. We should maybe pay more attention to whether we can influence which methods an AI ends up using.

Asimov’s Three Laws illustrate a mistake that makes imperfect alignment more catastrophic than it needs to be. He provided a clear rule for resolving conflicts. Why did he explicitly give corrigibility a lower priority than the First Law?

My guess is that imperfect alignment of current AIs would end up working like a rough approximation of the moral parliament. I’m guessing, with very low confidence, that that means humans get a slice of the lightcone.

AIs Won’t Keep Their Promises

From AIs won’t keep their promises:

Natural selection has difficulty finding the genes that cause a human to keep deals only in the cases where our long-term reputation is a major consideration. It was easier to just evolve an instinctive distaste for lying and cheating.

All of the weird fiddly cases where humans sometimes keep a promise even when it’s not actually beneficial to us are mainly evidence about what sorts of emotions were most helpful in our tribal ancestral environment while also being easy for evolution to encode into genomes, rather than evidence about some universally useful cognitive step.

This section convinced me that I’d been thoughtlessly overconfident about making deals with AIs. I’ve now switched to being mildly pessimistic about them keeping deals.

Yet the details of IABIED’s reasoning seem weird. I’m pretty sure that human honor is mostly a cultural phenomenon, combined with a modest amount of intelligent awareness of the value of reputation.

Cultural influences are more likely to be transmitted to AI values via training data than are genetic influences.

But any such honor is likely to be at least mildly context dependent, and the relevant context is novel enough to create much uncertainty.

More importantly, honor depends on incentives in complex ways. Once again, IABIED’s conclusions seem pretty likely given foom (which probably gives one AI a decisive advantage). Those conclusions seem rather unlikely in a highly multipolar scenario resulting from a fairly slow takeoff, where the AI needs to trade with a diverse set of other AIs. (This isn’t a complete description of the controversial assumptions that are needed in order to analyze this).

I predict fast enough takeoff that I’m worried that IABIED is correct here. But I have trouble respecting claims of more than 80% confidence here. IABIED sounds more than 80% confident.

Conclusion

See also my Comments on MIRI’s The Problem for more thoughts about MIRI’s overconfidence. IABIED’s treatment of corrigibility confirms my pessimism about their ability to recognize progress toward safety.

I agree that if humans, with no further enhancement, build ASI, the risks are unacceptably high. But we probably are on track for something modestly different from that: ASI built by humans who have been substantially enhanced by AIs that are almost superhuman.

I see AI companies as being reckless enough that we ought to be unsure of their sanity. Whereas if I fully agreed with IABIED, I’d say they’re insane enough that an ideal world would shut down and disband those companies completely.

This post has probably sounded too confident. I’m pretty confident that I’m pointing in the general direction of important flaws in IABIED’s arguments. But I doubt that I’ve done an adequate job of clarifying those flaws. I’m unsure how much of my disagreement is due to communication errors, and how much is due to substantive confusion somewhere.

IABIED is likely correct that following their paradigm would require decades of research in order to produce much progress. I’m mildly pessimistic that a shutdown which starts soon could be enforced for decades. IABIED didn’t do much to explain why the MIRI paradigm is good enough that we should be satisfied with it.

Those are some of the reasons why I consider it important to look for other paradigms. Alas, I only have bits and pieces of a good paradigm in mind.

It feels like any paradigm needs to make some sort of bet on the speed of takeoff. Slow takeoff implies a rather different set of risks than does foom. It doesn’t look like we can find a paradigm that’s fully appropriate for all takeoff speeds. That’s an important part of why it’s hard to find the right paradigm. Getting this right is hard, in part because takeoff speed can be influenced by whether there’s a race, and by whether AIs recommend slowing down at key points.

I expect that a good paradigm would induce more researchers to focus on corrigibility, whereas the current paradigms seem to cause neglect via either implying that corrigibility is too easy to require much thought, or too hard for an unenhanced human to tackle.

I’m 60% confident that the path to safety involves a focus on corrigibility. Thinking clearly about corrigibility seems at most Pearl-level hard. It likely still involves lots of stumbling about to ask the right questions and recognize good answers when we see them.

I’m disappointed that IABIED doesn’t advocate efforts to separate AIs goals from their world models, in order to make it easier to influence those goals. Yann LeCun has cost module that is separate from the AI’s world model. It would be ironic if LeCun ended up helping more than the AI leaders who say they’re worried about safety.

I approve of IABIED’s attempt to write for a broader audience than one that would accept the MIRI paradigm. It made the book a bit more effective at raising awareness of AI risks. But it left a good deal of confusion that an ideal book would have resolved.

P.S. - My AI-related investments did quite well in September, with the result that my AI-heavy portfolio was up almost 20%. I’m unsure how much of that is connected to the book—it’s not like there was a sudden surprise on the day the book was published. But my intuition says there was some sort of connection.

Eliezer continues to be more effective at persuading people that AI will be powerful, than at increasing people’s p(doom). But p(powerful) is more important to update than p(doom), as long as your p(doom) doesn’t round to zero.

Or maybe he did an excellent job of timing the book’s publication for when the world was ready to awaken to AI’s power.