TL;DR, Are they any works similar to Wei_Dai’s Ai Safety “Success Stories” that provide a framework to think about the landscape of possible success stories & pathways humanity will take to survive misaligned AI?
I’ve been trying to think of systematic ways of assessing non-technical proposals for improving the odds of humanity’s survival from misaligned AI.
Aside from numerous frameworks for assessing technical alignment proposals, I haven’t seen much resources on non-technical proposals that provide a concrete framework to think about the question: “What technological/geopolitical/societal pathway will our civilization most likely take (or should ideally take) in order to survive AI?”
Having such a framework seems pretty valuable since it would let us think about the exact alignment-pathway & context in which [proposals that want to help with alignment] would be effective at.
For example, a pretty clear dimension in which people’s opinions differ is in the necessity of pivotal acts i.e. “pivotal act vs gradual steering” (kind of oversimplified)—here, any proposal’s theory of impact will necessarily depend on their beliefs regarding (a) which position on the spectrum currently appears to be most likely by default, and (b) which position on the spectrum we should be aiming for.
If, say, my pessimistic containment strategy was about communicating AI risk to the capabilities people in order to promote cooperation between AI labs it would be incoherent for me to, at the same time, be ultra-pessimistic about humanity’s chances of enacting any cooperative regulation in the future.
Or if I thought a Pivotal Act was the best option humanity has, and wants to suggest some proposal that would be a force-multiplier if that line of strategy does happen in the future, it would make sense for my proposal to consider the forms in which this unilateralist org’s AI will take.
where will it developed?
will it be a corrigible AI whose safety features depend on human operators?
will it be a CEV-type AI whose safety feature won’t depend on humans?
how likely is it that the first AI capable of enacting a Pivotal Act will need to rely on human infrastructure—for how long—and could interventions help?
I’ve seen a lot of similar frameworks for technical alignment proposals, but nothing much for pathways our civilization will actually take to survive (Wei_Dai’s post is similar, but is mostly about the form that the AI will end up taking, without mentioning the pathways in which we’ll arrive at that outcome).
Any resources I might be missing? (if there aren’t any, I might write one)
You might find AI Safety Endgame Stories helpful—I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.
Nate’s “how various plans miss the hard bits of the alignment challenge” might also be helpful as it communicates the “dynamics of doom” that success stories have to fight against.
One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:
No serious global treaties on AI ever get signed.
Deceptive alignment turns out not to be a problem.
Mechanistic interpretability becomes impractical for large enough models.
CAIS turns out to be right, and AI agents simply aren’t economically competitive.
Multi-agent training becomes the dominant paradigm for AI.
Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.
These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that’s currently far from being true.
Thanks, I found your post very helpful and I think this community would benefit from posts similar as such.
I agree that we would need a clear categorization. Ideally, they would provide us a way to explicitly quantify/make-legible the claims of various proposals e.g. “my proposal, under these assumptions about the world, may give us X years of time, changes the world in these ways, and interacts with proposal A, B, C in these ways.
The lack of such is perhaps one of the reasons as to why I feel the pivotal act framing is still necessary. It seems to me that, while proposals closer to the “gradual steering” end of the spectrum (e.g. regulation, culture change, AI lab communication) usually are aimed at giving humanity a couple more months/years of extra time, they fail to make legible claims as above and yet (I might be wrong) proceed to implicitly claim “therefore, if we do a lot of these, we’re safe—even without any pivotal acts!”
(of course pivotal acts aren’t guilt-free and many of their details are hand-wavy, but their claims of impact & world-assumptions seem pretty straightforward. Are there non pivotal act type proposals like that?)
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.
Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact—actually realistic pivotal-seeming acts like “unilaterally deploy a friendly-seeming AI singleton” or “institute a stable global totalitarianism” are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I’ll be the first to sign on.
In contrast, gradual steering proposals like “improve AI lab communication” or “improve interpretability” have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist’s curse.
True, complete existential safety probably requires some measure of “solving politics” and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won’t put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.
Is it even possible for a non-pivotal act to ever achieve existential security? Even if we max-ed up AI lab communication and had awesome interpretability, that doesn’t help in the long-run given that the amount of minimum resources required to build a misaligned AGI will probably be keep dropping.
Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors—by denying them resources, by hardening core infrastructure, via MAD, etc.
It seems like the exact model which the AI will adopt is kinda confounding my picture when I’m trying to imagine how “existentially secure” a world looks like. I’m current thinking there are two possible existentially secure worlds:
The obvious one is where all human dependence is removed from setting/modifying the AI’s value system (like CEV, fully value-aligned)—this would look much more unipolar.
The alternate is for the well-intentioned-and-coordianted group to use a corrigible AI that is aligned with its human instructor. To me, whether this scenario looks existentially secure probably depends on “whether small differences in capability can magnify to great power differences”—if false, it would be much easier for capable groups to defect and make their own corrigible AI push agendas that may not be in favor of humanity’s interest (hence not so existentially secure). If true, then the world would again be more unipolar—and its existential secureness would depend on how value-aligned the humans that are operating the corrigible AI are (I’m guessing this is your offense-defense balance example?)
So it seems to me that the ideal end game is for humanity to end up with a value-aligned AI, either by starting with it or somehow going through the “dangerous period” of multipolar corrigible AIs and transition to a value-aligned one. Possible pathways (non-exhaustive).
I’m not sure whether this is a good framing at all (probably isn’t), but simply counting the number of dependencies (without taking into consider how plausible each dependencies are) it just seems to me that humanity’s chances would be better off with a unipolar takeover scenario—either using a value-aligned AI from the start or transitioning into one after a pivotal act.