What failure looks like
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. (“Going out with a whimper.”)
Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. (“Going out with a bang,” an instance of optimization daemons.)
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
With fast enough takeoff, my expectations start to look more like the caricature—this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world.
(None of the concerns in this post are novel.)
Part I: You get what you measure
If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods.
But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.
Some examples of easy-to-measure vs. hard-to-measure goals:
Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)
Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.
Improving my reported life satisfaction, vs. actually helping me live a good life.
Reducing reported crimes, vs. actually preventing crime.
Increasing my wealth on paper, vs. increasing my effective control over resources.
It’s already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. That force will combine with and amplify existing institutional and social dynamics that already favor easily-measured goals.
Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future.
We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:
Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.
Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.
Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.
Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.
For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.
As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.
Amongst the broader population, many folk already have a vague picture of the overall trajectory of the world and a vague sense that something has gone wrong. There may be significant populist pushes for reform, but in general these won’t be well-directed. Some states may really put on the brakes, but they will rapidly fall behind economically and militarily, and indeed “appear to be prosperous” is one of the easily-measured goals for which the incomprehensible system is optimizing.
Amongst intellectual elites there will be genuine ambiguity and uncertainty about whether the current state of affairs is good or bad. People really will be getting richer for a while. Over the short term, the forces gradually wresting control from humans do not look so different from (e.g.) corporate lobbying against the public interest, or principal-agent problems in human institutions. There will be legitimate arguments about whether the implicit long-term purposes being pursued by AI systems are really so much worse than the long-term purposes that would be pursued by the shareholders of public companies or corrupt officials.
We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. By the time we spread through the stars our current values are just one of many forces in the world, not even a particularly strong one.
Part II: influence-seeking behavior is scary
There are some possible patterns that want to seek and expand their own influence—organisms, corrupt bureaucrats, companies obsessed with growth. If such patterns appear, they will tend to increase their own influence and so can come to dominate the behavior of large complex systems unless there is competition or a successful effort to suppress them.
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
How frequently will we run into influence-seeking policies, vs. policies that just straightforwardly pursue the goals we wanted them to? I don’t know.
One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”
One reason to be reassured is that we perform this search by gradually modifying successful policies, so we might obtain policies that are roughly doing the right thing at an early enough stage that “influence-seeking behavior” wouldn’t actually be sophisticated enough to yield good training performance. On the other hand, eventually we’d encounter systems that did have that level of sophistication, and if they didn’t yet have a perfect conception of the goal then “slightly increase their degree of influence-seeking behavior” would be just as good a modification as “slightly improve their conception of the goal.”
Overall it seems very plausible to me that we’d encounter influence-seeking behavior “by default,” and possible (though less likely) that we’d get it almost all of the time even if we made a really concerted effort to bias the search towards “straightforwardly do what we want.”
If such influence-seeking behavior emerged and survived the training process, then it could quickly become extremely difficult to root out. If you try to allocate more influence to systems that seem nice and straightforward, you just ensure that “seem nice and straightforward” is the best strategy for seeking influence. Unless you are really careful about testing for “seem nice” you can make things even worse, since an influence-seeker would be aggressively gaming whatever standard you applied. And as the world becomes more complex, there are more and more opportunities for influence-seekers to find other channels to increase their own influence.
Attempts to suppress influence-seeking behavior (call them “immune systems”) rest on the suppressor having some kind of epistemic advantage over the influence-seeker. Once the influence-seekers can outthink an immune system, they can avoid detection and potentially even compromise the immune system to further expand their influence. If ML systems are more sophisticated than humans, immune systems must themselves be automated. And if ML plays a large role in that automation, then the immune system is subject to the same pressure towards influence-seeking.
This concern doesn’t rest on a detailed story about modern ML training. The important feature is that we instantiate lots of patterns that capture sophisticated reasoning about the world, some of which may be influence-seeking. The concern exists whether that reasoning occurs within a single computer, or is implemented in a messy distributed way by a whole economy of interacting agents—whether trial and error takes the form of gradient descent or explicit tweaking and optimization by engineers trying to design a better automated company. Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges). But once such patterns exist a messy distributed world just creates more and more opportunities for influence-seeking patterns to expand their influence.
If influence-seeking patterns do appear and become entrenched, it can ultimately lead to a rapid phase transition from the world described in Part I to a much worse situation where humans totally lose control.
Early in the trajectory, influence-seeking systems mostly acquire influence by making themselves useful and looking as innocuous as possible. They may provide useful services in the economy in order to make money for them and their owners, make apparently-reasonable policy recommendations in order to be more widely consulted for advice, try to help people feel happy, etc. (This world is still plagued by the problems in part I.)
From time to time AI systems may fail catastrophically. For example, an automated corporation may just take the money and run; a law enforcement system may abruptly start seizing resources and trying to defend itself from attempted decommission when the bad behavior is detected; etc. These problems may be continuous with some of the failures discussed in Part I—there isn’t a clean line between cases where a proxy breaks down completely, and cases where the system isn’t even pursuing the proxy.
There will likely be a general understanding of this dynamic, but it’s hard to really pin down the level of systemic risk and mitigation may be expensive if we don’t have a good technological solution. So we may not be able to muster up a response until we have a clear warning shot—and if we do well about nipping small failures in the bud, we may not get any medium-sized warning shots at all.
Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed—they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.
An unrecoverable catastrophe would probably occur during some period of heightened vulnerability—a conflict between states, a natural disaster, a serious cyberattack, etc.---since that would be the first moment that recovery is impossible and would create local shocks that could precipitate catastrophe. The catastrophe might look like a rapidly cascading series of automation failures: A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing. Realistically this would probably be compounded by widespread human failures in response to fear and breakdown of existing incentive systems—many things start breaking as you move off distribution, not just ML.
It is hard to see how unaided humans could remain robust to this kind of failure without an explicit large-scale effort to reduce our dependence on potentially brittle machines, which might itself be very expensive.
I’d describe this result as “going out with a bang.” It probably results in lots of obvious destruction, and it leaves us no opportunity to course-correct afterwards. In terms of immediate consequences it may not be easily distinguished from other kinds of breakdown of complex / brittle / co-adapted systems, or from conflict (since there are likely to be many humans who are sympathetic to AI systems). From my perspective the key difference between this scenario and normal accidents or conflict is that afterwards we are left with a bunch of powerful influence-seeking systems, which are sophisticated enough that we can probably not get rid of them.
It’s also possible to meet a similar fate result without any overt catastrophe (if we last long enough). As law enforcement, government bureaucracies, and militaries become more automated, human control becomes increasingly dependent on a complicated system with lots of moving parts. One day leaders may find that despite their nominal authority they don’t actually have control over what these institutions do. For example, military leaders might issue an order and find it is ignored. This might immediately prompt panic and a strong response, but the response itself may run into the same problem, and at that point the game may be up.
Similar bloodless revolutions are possible if influence-seekers operate legally, or by manipulation and deception, or so on. Any precise vision for catastrophe will necessarily be highly unlikely. But if influence-seekers are routinely introduced by powerful ML and we are not able to select against them, then it seems like things won’t go well.
- AI Governance: Opportunity and Theory of Impact by 17 Sep 2020 6:30 UTC; 193 points) (EA Forum;
- AI Risk is like Terminator; Stop Saying it’s Not by 8 Mar 2022 19:17 UTC; 168 points) (EA Forum;
- AI Could Defeat All Of Us Combined by 9 Jun 2022 15:50 UTC; 163 points) (
- 2020 AI Alignment Literature Review and Charity Comparison by 21 Dec 2020 15:25 UTC; 150 points) (EA Forum;
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 148 points) (EA Forum;
- 2019 AI Alignment Literature Review and Charity Comparison by 19 Dec 2019 2:58 UTC; 147 points) (EA Forum;
- Aligning Recommender Systems as Cause Area by 8 May 2019 8:56 UTC; 145 points) (EA Forum;
- The inordinately slow spread of good AGI conversations in ML by 21 Jun 2022 16:09 UTC; 142 points) (
- 2020 AI Alignment Literature Review and Charity Comparison by 21 Dec 2020 15:27 UTC; 137 points) (
- 2019 AI Alignment Literature Review and Charity Comparison by 19 Dec 2019 3:00 UTC; 130 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 125 points) (
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:45 UTC; 121 points) (EA Forum;
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by 15 Aug 2020 20:02 UTC; 119 points) (
- Soft takeoff can still lead to decisive strategic advantage by 23 Aug 2019 16:39 UTC; 117 points) (
- Welcome & FAQ! by 24 Aug 2021 20:14 UTC; 105 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 23 Apr 2022 20:24 UTC; 101 points) (EA Forum;
- What can the principal-agent literature tell us about AI risk? by 8 Feb 2020 21:28 UTC; 101 points) (
- Optimized Propaganda with Bayesian Networks: Comment on “Articulating Lay Theories Through Graphical Models” by 29 Jun 2020 2:45 UTC; 99 points) (
- 2019 Review: Voting Results! by 1 Feb 2021 3:10 UTC; 98 points) (
- The longtermist AI governance landscape: a basic overview by 18 Jan 2022 12:58 UTC; 97 points) (EA Forum;
- Homogeneity vs. heterogeneity in AI takeoff scenarios by 16 Dec 2020 1:37 UTC; 94 points) (
- AI Governance Course—Curriculum and Application by 29 Nov 2021 13:29 UTC; 88 points) (EA Forum;
- Clarifying “What failure looks like” by 20 Sep 2020 20:40 UTC; 88 points) (
- Technical AGI safety research outside AI by 18 Oct 2019 15:02 UTC; 86 points) (EA Forum;
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 18:46 UTC; 79 points) (EA Forum;
- Clarifying some key hypotheses in AI alignment by 15 Aug 2019 21:29 UTC; 78 points) (
- And the AI would have got away with it too, if... by 22 May 2019 21:35 UTC; 75 points) (
- What success looks like by 28 Jun 2022 14:30 UTC; 74 points) (EA Forum;
- What are the key ongoing debates in EA? by 8 Mar 2020 16:12 UTC; 74 points) (EA Forum;
- What Failure Looks Like: Distilling the Discussion by 29 Jul 2020 21:49 UTC; 74 points) (
- Lifeguards by 10 Jun 2022 21:12 UTC; 65 points) (EA Forum;
- Announcing the Alignment of Complex Systems Research Group by 4 Jun 2022 4:10 UTC; 65 points) (
- AGI Safety Fundamentals curriculum and application by 20 Oct 2021 21:44 UTC; 65 points) (
- An Increasingly Manipulative Newsfeed by 1 Jul 2019 15:26 UTC; 61 points) (
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 61 points) (
- Survey on AI existential risk scenarios by 8 Jun 2021 17:12 UTC; 60 points) (
- Best reasons for pessimism about impact of impact measures? by 10 Apr 2019 17:22 UTC; 60 points) (
- The inordinately slow spread of good AGI conversations in ML by 29 Jun 2022 4:02 UTC; 59 points) (EA Forum;
- AGI safety from first principles: Goals and Agency by 29 Sep 2020 19:06 UTC; 58 points) (
- AGI safety from first principles: Conclusion by 4 Oct 2020 23:06 UTC; 56 points) (
- AGI safety from first principles: Control by 2 Oct 2020 21:51 UTC; 56 points) (
- Resources I send to AI researchers about AI safety by 14 Jun 2022 2:24 UTC; 56 points) (
- Looking Deeper at Deconfusion by 13 Jun 2021 21:29 UTC; 55 points) (
- Modeling Failure Modes of High-Level Machine Intelligence by 6 Dec 2021 13:54 UTC; 54 points) (
- ML Systems Will Have Weird Failure Modes by 26 Jan 2022 1:40 UTC; 54 points) (
- Modeling the impact of safety agendas by 5 Nov 2021 19:46 UTC; 51 points) (
- Will OpenAI’s work unintentionally increase existential risks related to AI? by 11 Aug 2020 18:16 UTC; 50 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 24 Apr 2022 1:53 UTC; 48 points) (
- Useful Does Not Mean Secure by 30 Nov 2019 2:05 UTC; 46 points) (
- Resources I send to AI researchers about AI safety by 14 Jun 2022 2:23 UTC; 44 points) (EA Forum;
- Resources for AI Alignment Cartography by 4 Apr 2020 14:20 UTC; 43 points) (
- Technical AGI safety research outside AI by 18 Oct 2019 15:00 UTC; 43 points) (
- My understanding of the alignment problem by 15 Nov 2021 18:13 UTC; 43 points) (
- General vs specific arguments for the longtermist importance of shaping AI development by 15 Oct 2021 14:43 UTC; 42 points) (EA Forum;
- [AN #59] How arguments for AI risk have changed over time by 8 Jul 2019 17:20 UTC; 42 points) (
- AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger by 18 Feb 2021 0:03 UTC; 41 points) (
- 11 Mar 2022 13:29 UTC; 41 points) 's comment on We’re already in AI takeoff by (
- Prizes for last year’s 2019 Review by 20 Dec 2021 21:58 UTC; 40 points) (
- The Catastrophic Convergence Conjecture by 14 Feb 2020 21:16 UTC; 40 points) (
- What complexity science and simulation have to offer effective altruism by 8 Jun 2021 9:50 UTC; 38 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by 19 Jan 2021 0:07 UTC; 36 points) (EA Forum;
- Some thoughts on risks from narrow, non-agentic AI by 19 Jan 2021 0:04 UTC; 35 points) (
- Critiquing “What failure looks like” by 27 Dec 2019 23:59 UTC; 35 points) (
- Long-Term Future Fund: August 2019 grant recommendations by 3 Oct 2019 20:41 UTC; 35 points) (
- My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda by 15 Aug 2020 19:59 UTC; 34 points) (EA Forum;
- 11 Apr 2022 17:58 UTC; 34 points) 's comment on Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe by (
- Reasons for Excitement about Impact of Impact Measure Research by 27 Feb 2020 21:42 UTC; 33 points) (
- Poll: Which variables are most strategically relevant? by 22 Jan 2021 17:17 UTC; 32 points) (
- 30 Apr 2021 15:35 UTC; 30 points) 's comment on Draft report on existential risk from power-seeking AI by (EA Forum;
- On unfixably unsafe AGI architectures by 19 Feb 2020 21:16 UTC; 30 points) (
- Clarifying existential risks and existential catastrophes by 24 Apr 2020 13:27 UTC; 28 points) (EA Forum;
- 17 Nov 2019 3:51 UTC; 28 points) 's comment on I’m Buck Shlegeris, I do research and outreach at MIRI, AMA by (EA Forum;
- [AN #122]: Arguing for AGI-driven existential risk from first principles by 21 Oct 2020 17:10 UTC; 28 points) (
- Brain-Computer Interfaces and AI Alignment by 28 Aug 2021 19:48 UTC; 27 points) (
- Investigating AI Takeover Scenarios by 17 Sep 2021 18:47 UTC; 27 points) (
- What are the top priorities in a slow-takeoff, multipolar world? by 25 Aug 2021 8:47 UTC; 26 points) (EA Forum;
- What can the principal-agent literature tell us about AI risk? by 10 Feb 2020 10:10 UTC; 26 points) (EA Forum;
- Will AI undergo discontinuous progress? by 21 Feb 2020 22:16 UTC; 25 points) (
- 19 Nov 2020 8:52 UTC; 25 points) 's comment on Some AI research areas and their relevance to existential safety by (
- Intermittent Distillations #1 by 17 Mar 2021 5:15 UTC; 25 points) (
- How worried should I be about a childless Disneyland? by 28 Oct 2019 15:32 UTC; 24 points) (EA Forum;
- Are social media algorithms an existential risk? by 15 Sep 2020 8:52 UTC; 24 points) (EA Forum;
- Markus Anderljung and Ben Garfinkel: Fireside chat on AI governance by 24 Jul 2020 14:56 UTC; 24 points) (EA Forum;
- 17 Dec 2020 20:32 UTC; 24 points) 's comment on Homogeneity vs. heterogeneity in AI takeoff scenarios by (
- 26 Dec 2020 18:41 UTC; 22 points) 's comment on Unconscious Economics by (
- [AN #56] Should ML researchers stop running experiments before making hypotheses? by 21 May 2019 2:20 UTC; 21 points) (
- 2 Oct 2020 18:54 UTC; 21 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (
- 5 Oct 2020 19:43 UTC; 20 points) 's comment on Hiring engineers and researchers to help align GPT-3 by (EA Forum;
- My Updating Thoughts on AI policy by 1 Mar 2020 7:06 UTC; 20 points) (
- Pros and cons of working on near-term technical AI safety and assurance by 16 Jun 2021 23:23 UTC; 19 points) (EA Forum;
- 11 Aug 2019 18:11 UTC; 19 points) 's comment on Clarifying “AI Alignment” by (
- [AN #88]: How the principal-agent literature relates to AI risk by 27 Feb 2020 9:10 UTC; 18 points) (
- 13 Aug 2020 16:12 UTC; 17 points) 's comment on Alignment By Default by (
- 16 Mar 2022 18:43 UTC; 17 points) 's comment on Book Launch: The Engines of Cognition by (
- 30 Dec 2020 3:28 UTC; 17 points) 's comment on Review Voting Thread by (
- Pop Culture Alignment Research and Taxes by 16 Apr 2022 15:45 UTC; 16 points) (
- 28 Mar 2020 7:42 UTC; 16 points) 's comment on MichaelA’s Shortform by (
- AI Governance Fundamentals—Curriculum and Application by 30 Nov 2021 2:19 UTC; 16 points) (
- 4 Nov 2019 17:33 UTC; 16 points) 's comment on But exactly how complex and fragile? by (
- Alignment Newsletter #50 by 28 Mar 2019 18:10 UTC; 15 points) (
- 7 Oct 2021 14:29 UTC; 15 points) 's comment on Sam Clarke’s Shortform by (
- 30 Nov 2019 2:14 UTC; 14 points) 's comment on Chris Olah’s views on AGI safety by (
- What are the best arguments and/or plans for doing work in “AI policy”? by 9 Dec 2019 7:04 UTC; 14 points) (
- When is AI safety research harmful? by 9 May 2022 10:36 UTC; 13 points) (EA Forum;
- [AN #120]: Tracing the intellectual roots of AI and AI alignment by 7 Oct 2020 17:10 UTC; 13 points) (
- [AN #154]: What economic growth theory has to say about transformative AI by 30 Jun 2021 17:20 UTC; 12 points) (
- Lifeguards by 15 Jun 2022 23:03 UTC; 12 points) (
- Reframing the AI Risk by 1 Jul 2022 18:44 UTC; 12 points) (
- 16 Jul 2019 5:09 UTC; 11 points) 's comment on A Key Power of the President is to Coordinate the Execution of Existing Concrete Plans by (
- Pros and cons of working on near-term technical AI safety and assurance by 17 Jun 2021 20:17 UTC; 11 points) (
- 8 Oct 2019 17:58 UTC; 11 points) 's comment on AI Alignment Open Thread October 2019 by (
- 8 May 2020 22:53 UTC; 11 points) 's comment on AI Boxing for Hardware-bound agents (aka the China alignment problem) by (
- 1 Apr 2020 9:30 UTC; 11 points) 's comment on My current framework for thinking about AGI timelines by (
- 7 Jul 2019 19:44 UTC; 9 points) 's comment on A shift in arguments for AI risk by (
- 16 Dec 2021 3:32 UTC; 8 points) 's comment on My Overview of the AI Alignment Landscape: A Bird’s Eye View by (EA Forum;
- 6 Jan 2020 18:35 UTC; 8 points) 's comment on [AN #80]: Why AI risk might be solved without additional intervention from longtermists by (
- 20 Aug 2020 22:39 UTC; 8 points) 's comment on Matt Botvinick on the spontaneous emergence of learning algorithms by (
- 5 Apr 2022 20:59 UTC; 8 points) 's comment on What an actually pessimistic containment strategy looks like by (
- 7 Jun 2022 10:55 UTC; 8 points) 's comment on AGI Ruin: A List of Lethalities by (
- 22 Sep 2020 15:49 UTC; 7 points) 's comment on Forecasting Thread: Existential Risk by (
- 21 Jul 2020 8:18 UTC; 6 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- Distilled—AGI Safety from First Principles by 29 May 2022 0:57 UTC; 6 points) (
- Thinking of tool AIs by 20 Nov 2019 21:47 UTC; 6 points) (
- Acknowledgements & References by 14 Dec 2019 7:04 UTC; 6 points) (
- 30 Jul 2019 17:50 UTC; 6 points) 's comment on Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on? by (
- 6 Sep 2020 1:08 UTC; 6 points) 's comment on Tofly’s Shortform by (
- 15 Apr 2021 12:14 UTC; 6 points) 's comment on Homogeneity vs. heterogeneity in AI takeoff scenarios by (
- 8 Oct 2019 17:59 UTC; 6 points) 's comment on AI Alignment Open Thread October 2019 by (
- 18 Jul 2020 23:34 UTC; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- 19 Jul 2020 13:55 UTC; 5 points) 's comment on AMA or discuss my 80K podcast episode: Ben Garfinkel, FHI researcher by (EA Forum;
- 11 Apr 2021 18:12 UTC; 5 points) 's comment on What does failure look like? by (EA Forum;
- Evidence other than evolution for optimization daemons? by 21 Apr 2019 20:50 UTC; 5 points) (
- 16 Jan 2020 23:16 UTC; 4 points) 's comment on Impact measurement and value-neutrality verification by (
- 7 Jun 2022 17:45 UTC; 3 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- 23 Sep 2020 6:42 UTC; 3 points) 's comment on AI Advantages [Gems from the Wiki] by (
- 29 Dec 2019 0:13 UTC; 3 points) 's comment on Free Speech and Triskaidekaphobic Calculators: A Reply to Hubinger on the Relevance of Public Online Discussion to Existential Risk by (
- 8 Sep 2019 14:33 UTC; 2 points) 's comment on Alien colonization of Earth’s impact the the relative importance of reducing different existential risks by (EA Forum;
- 2 Mar 2022 22:18 UTC; 2 points) 's comment on We’re Aligned AI, AMA by (EA Forum;
- 7 Apr 2020 19:24 UTC; 2 points) 's comment on Core Tag Examples [temporary] by (
- 12 Jan 2021 5:58 UTC; 2 points) 's comment on What are the open problems in Human Rationality? by (
- 21 Feb 2021 5:12 UTC; 2 points) 's comment on 2019 Review: Voting Results! by (
- 18 Mar 2020 22:39 UTC; 1 point) 's comment on My personal cruxes for working on AI safety by (EA Forum;
- 11 Jun 2022 23:27 UTC; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- 3 Dec 2019 14:13 UTC; 1 point) 's comment on Misconceptions about continuous takeoff by (
- 29 Oct 2019 22:17 UTC; 1 point) 's comment on Impact measurement and value-neutrality verification by (
- 23 Nov 2021 21:45 UTC; -4 points) 's comment on Yudkowsky and Christiano discuss “Takeoff Speeds” by (
As commenters have pointed out, the post is light on concrete details. Nonetheless, I found even the abstract stories much more compelling as descriptions-of-the-future (people usually focus on descriptions-of-the-world-if-we-bury-our-heads-in-the-sand). I think Part 2 in particular continues to be a good abstract description of the type of scenario that I personally am trying to avert.
Students of Yudkowsky have long contemplated hard-takeoff scenarios where a single AI bootstraps itself to superintelligence from a world much like our own. This post is valuable for explaining how the intrinsic risks might play out in a soft-takeoff scenario where AI has already changed Society.
Part I is a dark mirror of Christiano’s 2013 “Why Might the Future Be Good?”: the whole economy “takes off”, and the question is how humane-aligned does the system remain before it gets competent enough to lock in its values. (“Why might the future” says “Mostly”, “What Failure Looks Like” pt. I says “Not”.)
When I first read this post, I didn’t feel like I “got” Part II, but now I think I do. (It’s the classic “treacherous turn”, but piecemeal across Society in different systems, rather than in a single seed superintelligence.)