From the piece:
Earlier this year I decided to take a few weeks to figure out what I think about the existential risk from Artificial Superintelligence (ASI xrisk). It turned out to be much more difficult than I thought. After several months of reading, thinking, and talking with people, what follows is a discussion of a few observations arising during this exploration, including:
Three ASI xrisk persuasion paradoxes, which make it intrinsically difficult to present strong evidence either for or against ASI xrisk. The lack of such compelling evidence is part of the reason there is such strong disagreement about ASI xrisk, with people often (understandably) relying instead on prior beliefs, self-interest, and tribal reasoning to decide their opinions.
The alignment dilemma: should someone concerned with xrisk contribute to concrete alignment work, since it’s the only way we can hope to build safe systems; or should they refuse to do such work, as contributing to accelerating a bad outcome? Part of a broader discussion of the accelerationist character of much AI alignment work, so capabilities / alignment is a false dichotomy.
The doomsday question: are there recipes for ruin—simple, easily executed, immensely destructive recipes that could end humanity, or wreak catastrophic world-changing damage?
What bottlenecks are there on ASI speeding up scientific discovery? And, in particular: is it possible for ASI to discover new levels of emergent phenomena, latent in existing theories?
Here are the passages I thought were interesting enough to tweet about:
“So, what’s your probability of doom?” I think the concept is badly misleading. The outcomes humanity gets depend on choices we can make. We can make choices that make doom almost inevitable, on a timescale of decades – indeed, we don’t need ASI for that, we can likely4 arrange it in other ways (nukes, engineered viruses, …). We can also make choices that make doom extremely unlikely. The trick is to figure out what’s likely to lead to flourishing, and to do those things. The term “probability of doom” began frustrating me after starting to routinely hear people at AI companies use it fatalistically, ignoring the fact that their choices can change the outcomes. “Probability of doom” is an example of a conceptual hazard5 – a case where merely using the concept may lead to mistakes in your thinking. Its main use seems to be as marketing: if widely-respected people say forcefully that they have a high or low probability of doom, that may cause other people to stop and consider why. But I dislike concepts which are good for marketing, but bad for understanding; they foster collective misunderstanding, and are likely to eventually lead to collective errors in action.
With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today’s AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There’s an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he’s only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.
Over the past decade I’ve met many AI safety people who speak as though “AI capabilities” and “AI safety/alignment” work is a dichotomy. They talk in terms of wanting to “move” capabilities researchers into alignment. But most concrete alignment work is capabilities work. It’s a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.
“Does this mean you oppose such practical work on alignment?” No! Not exactly. Rather, I’m pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it’s only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn’t the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it’s making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don’t proposition.
“What are those intrinsic reasons it’s hard to make a case either for or against xrisk?” There are three xrisk persuasion paradoxes that make it difficult. Very briefly, these are:
The most direct way to make a strong argument for xrisk is to convincingly describe a detailed concrete pathway to extinction. The more concretely you describe the steps, the better the case for xrisk. But of course, any “progress” in improving such an argument actually creates xrisk. “Here’s detailed instructions for how almost anyone can easily and inexpensively create an antimatter bomb: [convincing, verifiable instructions]” makes a more compelling argument for xrisk than speculating that: “An ASI might come up with a cheap and easy recipe by which almost anyone can easily create antimatter bombs.” Or perhaps you make “progress” by filling in a few of the intermediate steps an ASI might have to do. Maybe you show that antimatter is a little easier to make that one might a priori have thought. Of course, you should avoid making such arguments entirely, or working on it. It’s a much more extreme version of your boss asking you to make a case for why you should be fired: that’s a very good time to exhibit strategic incompetence. The case for ASI xrisk is often made in very handwavy and incomplete ways; critics of xrisk then dismiss those vague arguments. I recently heard an AI investor complain: “The doomers never describe in convincing detail how things will go bad”. I certainly understand their frustration; at the same time, that vagueness is something to celebrate and preserve.
Any sufficiently strong argument for xrisk will likely alter human actions in ways that avert xrisk. The stronger the argument, paradoxically, the more likely it is to avert xrisk. Suppose that in the late 1930s or early 1940s someone had found and publicized a truly convincing argument that nuclear weapons would set the world’s atmosphere on fire. If that was the case then the bombs would have been much less likely to have been developed and used. Similarly, as our understanding of human-caused climate change has improved it has gradually caused enormous changes in human action. As one of many examples, in recent years our use of renewable energy has typically grown at a rate of about 15% per year, whereas the rate of fossil fuel growth is no more than a few percent per year, and sometimes much less. That relative increase is not due entirely to climate fears, but those fears have certainly helped drive investment and progress in renewables.
By definition, any pathway to xrisk which we can describe in detail doesn’t require superhuman intelligence. A variation of this problem shows up often in fiction: it is difficult for authors to convincingly write characters who are far smarter than the author. I love Vernor Vinge’s description of this as a general barrier making it hard to write science fiction, “an opaque wall across the future” in Vinge’s terms16. Of course, it’s not an entirely opaque wall, in that any future ASI will be subject to many constraints we can anticipate today. They won’t be able to prove that 2+2=5; they won’t be able to violate the laws of physics; they likely (absent very unexpected changes) won’t be able to solve the halting problem or to solve NP-complete problems in polynomial time.
“So, when will ASI be able to think its way to new discoveries?” There’s a flipside to the above, which is that ASI can be expected to excel in situations where we already have extremely accurate predictive theories; the contingencies are already known and incorporated into the theory, in detail. Indeed, there are already cases where humanity has used such theories to great advantage to make substantial further “progress”32, mostly through thinking (“theory and/or simulation”) alone, perhaps augmented with a little experiment:
The first atomic bombs were designed in considerable part using theory and simulation, built on models of neutrons and the nucleus which had been developed in the 1930s. Experiment was still required, but it’s remarkable the extent to which theorists thinking drove the design of the bomb.
The first hydrogen bombs relied even more heavily on theory, in part because the conditions inside an atomic bomb – used to trigger the hydrogen bomb – were so unusual that they were difficult to study experimentally. Indeed, the very first calculation carried out on the ENIAC computer was a theoretical simulation of the hydrogen bomb.
The first stealth fighter was designed principally through theory and simulation, though it was done in conjunction with some physical experimentation.
Bose-Einstein condensation was predicted in advance through theory alone. I am told that in a seminar Phil Anderson, originator of the term “emergence”, once said that emergent phenomena were so surprising as to be impossible to predict in advance. Someone in the audience at the seminar said “what about Bose-Einstein condensation?” Anderson shot back that that didn’t count, since Einstein was a genius33.
LIGO was driven by and relied upon theory and simulation to a staggering extent; a lot of experimental input was still necessary to characterize and suppress the noise sources, which were not fully understood or characterized in advance.
Many remarkable phenomena have been predicted in advance based on extant theories, including lasing, the quantum Hall (but not fractional quantum Hall) effect, quantum teleportation, topological quantum computing, and others.
It’s intriguing to think of striking new phenomena within computers as examples of this general pattern. Something like, for example, public-key cryptography seems to me an instance of a scientific discovery made entirely through theory, grounded in some extant theoretical system (in that case, the Turing machine theory of computation, as well as some broad notions of cryptography).