Collection of discussions of key cruxes related to AI safety/alignment
These are works that highlight disagreements, cruxes, debates, assumptions, etc. about the importance of AI safety/alignment, about which risks are most likely, about which strategies to prioritise, etc.
I’ve also included some works that attempt to clearly lay out a particular view in a way that could be particularly helpful for others trying to see where the cruxes are, even if the work itself don’t spend much time addressing alternative views. I’m not sure precisely where to draw the boundaries in order to make this collection maximally useful.
These are ordered from most to least recent.
I’ve put in bold those works that very subjectively seem to me especially worth reading.
General, or focused on technical work
Ben Garfinkel on scrutinising classic AI risk arguments − 80,000 Hours, 2020
Critical Review of ‘The Precipice’: A Reassessment of the Risks of AI and Pandemics—James Fodor, 2020; this received pushback from Rohin Shah, which resulted in a comment thread worth adding here in its own right
Fireside Chat: AI governance—Ben Garfinkel & Markus Anderljung, 2020
My personal cruxes for working on AI safety—Buck Shlegeris, 2020
What can the principal-agent literature tell us about AI risk? - Alexis Carlier & Tom Davidson, 2020
Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society—Carina Prunkl & Jess Whittlestone, 2020 (commentary here)
Interviews with Paul Christiano, Rohin Shah, Adam Gleave, and Robin Hanson—AI Impacts, 2019 (summaries and commentary here and here)
Brief summary of key disagreements in AI Risk—iarwain, 2019
A list of good heuristics that the case for AI x-risk fails—capybaralet, 2019
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More − 2019
Clarifying some key hypotheses in AI alignment—Ben Cottier & Rohin Shah, 2019
A shift in arguments for AI risk—Tom Sittler, 2019 (summary and discussion here)
The Main Sources of AI Risk? - Wei Dai & Daniel Kokotajlo, 2019
Current Work in AI Alignment—Paul Christiano, 2019 (key graph can be seen at 21:05)
What failure looks like—Paul Christiano, 2019 (critiques here and here; counter-critiques here; commentary here)
Disentangling arguments for the importance of AI safety—Richard Ngo, 2019
Reframing superintelligence—Eric Drexler, 2019 (I haven’t yet read this; maybe it should be in bold)
Prosaic AI alignment—Paul Christiano, 2018
How sure are we about this AI stuff? - Ben Garfinkel, 2018 (it’s been a while since I watched this; maybe it should be in bold)
AI Governance: A Research Agenda—Allan Dafoe, 2018
Some conceptual highlights from “Disjunctive Scenarios of Catastrophic AI Risk”—Kaj Sotala, 2018 (full paper here)
A model I use when making plans to reduce AI x-risk—Ben Pace, 2018
Interview series on risks from AI—Alexander Kruel (XiXiDu), 2011 (or 2011 onwards?)
Focused on takeoff speed/discontinuity/FOOM specifically
Discontinuous progress in history: an update—Katja Grace, 2020 (also some more comments here)
My current framework for thinking about AGI timelines (and the subsequent posts in the series) - zhukeepa, 2020
What are the best arguments that AGI is on the horizon? - various authors, 2020
The AI Timelines Scam—jessicat, 2019 (I also recommend reading Scott Alexander’s comment there)
Double Cruxing the AI Foom debate—agilecaveman, 2018
Quick Nate/Eliezer comments on discontinuity − 2018
Arguments about fast takeoff—Paul Christiano, 2018
Likelihood of discontinuous progress around the development of AGI—AI Impacts, 2018
The Hanson-Yudkowsky AI-Foom Debate—various works from 2008-2013
Focused on governance/strategy work
My Updating Thoughts on AI policy—Ben Pace, 2020
Some cruxes on impactful alternatives to AI policy work—Richard Ngo, 2018
Somewhat less relevant
A small portion of the answers here − 2020
I intend to add to this list over time. If you know of other relevant work, please mention it in a comment.
I agree that it’s valuable to note that information hazards can sometimes hurt the person who gets the information themselves. And I agree that Bostrom’s sense of information hazards is definitely broader than just that, so if people are using infohazards to mean just information that harms the person who knows it specifically, then clearing up their confusion seems good.
But I don’t know if memetic hazards is a great term for that, because it seems most natural to use the label “memetic hazards” for a superset of information hazards, not a subset. “Memes” are ideas or units of culture, of which true information is just one type. So it seems most natural to use the term “memetic hazards” for something like “harms that result from ideas” (or perhaps “ideas that spread”, or “ideas that evolve”), rather than just from true information, and rather than just harms for the knower (or just for the holder of the idea).
I think the fact that memetic hazards is already used in some places the way you propose using it is one reason to accept the term anyway. But I’m not sure it’s a strong enough reason, given 1) how unintuitive the term seems to be for what we want it to capture, and 2) the fact that the term seems like it is intuitive for a separate concept that would also be worth talking about (so perhaps we should hesitate to use up the term for something else). And it seems somewhat hard to come up with alternative terms for that separate concept—in particular, “idea hazards” is already used in a different way by Bostrom, so that’s not a good candidate.
In fact, “meme hazards” has already been used in roughly the way I suggest above, and I’m currently helping revamp the ideas in the post that use that, and was hoping to use the term “memetic hazards” for that purpose. (And this was going to be published this week, ironically enough—we’ve been scooped!) We did notice that the term memetic hazards was already used in the way you suggest, but thought that that use was sufficiently non-mainstream and sufficiently non-intuitive that it might make sense to stick with our proposed usage.
I don’t have great ideas for an alternative term for the concept you wish to point to, but perhaps something in the direction of “knower-harming infohazards”, “self-affecting infohazards”, or “internalised infohazards”?