Stages of Survival

Link post

This post outlines a fake framework for thinking about how we might navigate the future. I found it useful to my thinking, hopefully you will find it useful as well.

Whether or not we build AGI, one of five things must result.

Dune solution: The new world permanently lacks AGIs.
Human control: Humans retain control over AGIs indefinitely, avoid catastrophe.
Human-induced catastrophe: Humans retain control, cause catastrophe.
Aligned AGIs in control: AGIs do take control, but it is not catastrophic.
AGI-induced catastrophe: AGIs do take control, it is catastrophic.

There are disagreements in classifying many scenarios, both in terms of the source of catastrophe (#3 vs. #5) and in whether the scenario is catastrophic (#2 vs. #3, #4 vs. #5), as well as how to think about #1 in terms of whether it is inherently catastrophic, and whether if so it would count as being due to loss of control.

These disagreements are an underappreciated source of cruxes and disagreement over the probability of doom, including the conditional probability of doom.

At any given time, we could, as a simplification, describe the current world state as having:

Some degree of AGI capabilities progress (a-h) and some distribution of meaningful potential capabilities (i-iii).
1. No base model exists that could result in an AGI without a new training run.
2. At least one base model exists that could result in an AGI with sufficient scaffolding work, or some other process with radically lower compute than required to create the base model. We don’t yet know how to do this, and may not know that it is possible.
  1. There is one such model. Access is highly restricted.
  2. There is more than one such model. Access is meaningfully restricted.
  3. Access to such models is not meaningfully restricted.
3. At least one base model exists that could result in an AGI with sufficient scaffolding work. We know how to do this.
  1. There is one such model. Access is highly restricted.
  2. There is more than one such model. Access is meaningfully restricted.
  3. Access to such models is not meaningfully restricted.
4. AGI exists under human control. Of the AGIs that are meaningfully capable compared to the strongest AGI:
  1. There is one AGI, or one individual or group in control of all such AGIs. Access to its important capabilities is highly restricted.
  2. There are a limited number of AGIs or AGI groups under the control of a limited number of different entities. Access to important capabilities is highly restricted.
  3. There are meaningfully capable AGIs available to many people.
5. AGI exists, its meaningful capabilities not entirely under human control.
  1. There exists one meaningfully capable AGI or group of AGIs, and it is not under human control. Go to (f)(i).
  2. There are a limited number of AGIs or AGI groups under the control of a limited number of different entities. Some of these AGIs are not under human control.
  3. There are meaningfully capable AGIs available to many people. Some of them are not under human control.
6. AGI is mostly not under human control, yet humans control the future.
  1. One AGI or coordinated AGI group exists.
  2. A limited number of AGI groups compete.
  3. A large number of AGI groups compete.
  4. NOTE: (f) is included for completeness. If you think that (f) has meaningful probability of holding true for a meaningful amount of human time, we have a crux, or at least a strong definitional disagreement.
7. AGI controls the future.
  1. One AGI or coordinated AGI group effectively controls the future.
  2. A limited group of AGIs compete, collectively control the future.
  3. A large number of AGIs compete, collectively controlling the future.
8. (potentially) AGI also becomes grabby, in the Grabby Aliens sense.

By default, these scenarios will mostly move forward from earlier to later letters, and from lower to higher roman numerals until we get to (f). Moving backwards is possible, but more difficult, often dramatically so.

Which of these scenarios could constitute a non-catastrophic equilibrium?

That requires both not being catastrophic, and being an equilibrium.

As will be mentioned repeatedly, navigating this space without dying will require some degree and type of alignment, to an extent we do not currently have a known way of achieving, with strong disagreement and a clear crux around the difficulty level of doing that in either theory or practice. Different parts of the space require different degrees of alignment.

Scenarios Outside Human Control [(g) above]

Unless we can get robust, high-level alignment to human survival and value for the AGIs in meaningful competition, that survives any competitive and other pressures as they change over time, or you are fine with human extinction in favor of AGIs that are inherently valuable to you, our paths to avoid catastrophe in (g) seem bleak.

It is an open question and potential crux whether (g) is inherently catastrophic due to loss of human control over the future, and whether the value of such futures must lie within humans rather than AGIs if it is to be non-catastrophic, including the question of (definitions aside) whether human extinction is necessarily catastrophic here.

If the scenario involves meaningful competition between AGIs, human survival requires, at minimum, an explanation of why our resources are not expropriated, and how we are able to use those resources for indefinite survival.

If this is done via AI alignment, that alignment must be robust to the changes that would result during competition between AIs, despite the competitive disadvantages inherent in preserving human resources and caring about humans in this way, over many extremely rapid cycles of change, and likely interstellar expansion. This is potentially a very high degree of required alignment.

If this is done without such highly robust AI alignment, another mechanism is required. Could humans find some sufficiently important comparative advantage that holds over time as AGI capabilities and efficiency (presumably) further improve?

Perhaps because the AGIs would, as Musk suggests, ‘find us interesting’ enough to keep us around? Could we survive because AGIs coordinate to preserve rule of law and property rights, without cutting us out, despite all our disadvantages, or as consequences of their social dynamics, signaling games or decision theory? Including, if you find such things likely, our presumed inability to use many coordination mechanisms available to AGIs through seeing each others’ code or other similar mechanisms? Could some form of honesty or niceness or something else be preserved through all this, perhaps through such mechanisms above, that does the work, somehow? Human history and the game theory of the situation are not optimistic about the long-term integrity of such institutions.

Perhaps we could survive for odd decision theoretic reasons, such as what might happen when encountering grabby aliens in the future, or acausal trade of some kind?

If AGI meaningful competition is and remains limited or is entirely prevented, the degree of successful alignment required for human survival would likely decline, as such an AGI or AGIs would face more limited competitive pressures, and would better be able to afford and preserve their alignment over time.

To survive, we would still need to have a degree of alignment sufficient for the AGI or coordinated group to choose for us to survive, and to preserve that preference over time, or we would need to have a sufficiently important comparable advantage, or we would need to benefit from their coordination mechanisms, or one of the other considerations mentioned above might still apply.

Scenarios where we retain control over the future without controlling most AGIs [f above]

It is easy to see how such a scenario, if sustained, would presumably by default be a non-catastrophic future.

The question is, how would such a scenario be sustained? Why would the AGIs not take control over the future? This seems implausible if not contradictory. What straws might one grasp?

One potential answer is ‘because we programmed them not to’ via alignment, or via some of them being uploads perhaps, including ensuring they don’t alter that over time in ways that disrupt this, despite the lack of other controls on them, if that is compatible with them being out of our control.

Another answer is a combination of competitive pressures driving price equal to cost for many, most or all AGI services, and humans retaining control of the taxing and enforcement mechanisms, resulting in AGI being unable to accumulate capital.

Other answers involve the AGIs being unable to coordinate, or sustainably lacking important human capabilities over time in sufficiently strong ways, and not winning the economic competition. While humans coordinate to remain in control of the future and stand strong. Both despite all the disadvantages humans would have in such a situation.

For completeness, it must be noted that some simply assert that AGI, even if it is created, will never be able to generally outcompete or overpower or outmaneuver humans, even if we were to lose control over those AGIs, without stating a concrete crux as to why they believe this, or stating that intelligence is not so powerful so the AGIs will not be so capable.

Scenarios where we retain control over many but not all AGIs [e above]

There could be a substantial number of uncontrolled AGIs while the balance of power favors those under our control.

One potential resolution to this situation is use the AGIs we do control to regain control of, or shut down, the uncontrolled AGIs, returning us to (d), especially if the number of uncontrolled AGIs is small.

A potentially important point of disagreement is how difficult it would be in practice to shut down an uncontrolled AGI that was loose on the internet, under a variety of different scenarios, and to what extent we would notice the situation, decide to act, commit resources and coordinate to do so.

Many assume that if humans noticed an AGI loose on the internet taking actions, we would do whatever was necessary to shut it down, and that we would have the means to do so. In the most likely scenarios in question, both of these claims seem highly uncertain. In practice, cooperation and resources and affordances will be limited, even if the AGI does not interfere and does not take superintelligent, hard to anticipate measures to defend itself. It is also not clear how much having our own AGIs would help, especially if our AGIs were subject to our laws and regulations (see ‘the strategy-stealing assumption’).

Most of the arguments and questions about how difficult it will be to get coordination on slowing or limiting AI capabilities are paralleled in the difficulty of coordinating to shut down uncontrolled AGIs. Often the problems take a similar yet harder shape, even before taking into account that the AGI is smarter than us and cannot be fully anticipated. Coordination is hard.

This also ties into the question of how easy it would be for an uncontrolled AGI to acquire resources, and potentially to improve its capabilities, and to what extent it would have to break laws, draw attention, cause damage or otherwise worsen its strategic position, which of course depends on its capabilities, both in absolute terms and relative to other AGIs that may exist, and what alignment properties if any it retained. Right now it isn’t even automatically illegal to release malware.

Even if we could not shut such an AGI down, if its resource gathering capabilities or ability to use them could be limited, that could be sufficient to contain the problem, and solve it over a longer time horizon. This seems much more plausible if other similarly capable AGIs are available to compete away or block opportunity for such actions, but some consider it a plausible plan even without that.

A variation of this would be to give structural advantages to controlled AGIs over uncontrolled AGIs, in terms of access to resources and opportunities or otherwise, sufficient to allow controlled AGIs to outcompete the uncontrolled. To the extent that control is a meaningful concept, those advantages might need to be very large, and one would need to worry that such rules would put nations or groups at a large competitive disadvantage.

One potential difficulty in all this is that uncontrolled AGIs might find it not so difficult to gain control of some humans, either by paying them or otherwise. Once that happened, it might not be so easy to know or to prove that the AGI was in control of the human, rather than the human in control of the AGI, so we might be unable to discriminate against the uncontrolled AGI or even identify it, let alone attempt to shut it down.

This does not mean that uncontrolled AGIs, if not brought back under control or shut down or the rules modified so that they would be uncompetitive while their opportunities are contained, would necessarily outcompete other AGIs or even humans. Some, again, simply assert that no amount of intelligence and capabilities are possible that would allow AGIs to generally outcompete humans.

Given all that: In order for [e3] to be a stable equilibrium, we also must not advance from [e3] to [f3], which means we must hold steady at some robust level the share of meaningful AGI capability that is under human control. How might that be possible?

We could potentially cut all uncontrolled AGIs off sufficiently from opportunity, or coordinate on sufficiently large economic penalties and taxes or other additional costs, or otherwise tip the scales (including possibly by using human-empowering techniques potentially including uploads), such that the uncontrolled AGIs could only compete for a fixed pool of available resources, and this competition might drive their profits to zero and leave them little or no surplus or room to expand, so they’d stabilize at a fixed share of the pie.

In some way, we would need to limit the growth of uncontrolled systems, and the rate at which systems that were controlled become uncontrolled or less controlled over time due to the advantages of doing so, and the ability of so many different humans or groups to choose to loosen their control over AGI systems to gain an advantage, or simply because it sounded like fun or they made a mistake.

For such a strategy to succeed, it seems like we would need a combination of very strong international coordination despite any economic headwinds to such coordination and require a lot of highly intrusive government intervention, on something that will have far from universal backing, and the capabilities landscape would have to make such an intervention practical at all, when similar efforts against software that isn’t intelligent have so far had limited success. And, of course, we would need to have sufficient alignment capabilities to have the ability to reliably retain control over systems, should we wish to exert that control.

The [e2] scenario is easier insofar as there is a smaller number of distinct uncontrolled AGIs or sources of AGIs, both that are already uncontrolled or that might become less controlled or uncontrolled over time. It seems much more plausible that a small number of actors could coordinate to retain strong control over their systems, given such control had been achieved. This would still require containing already-uncontrolled systems, including their ability to give access to AGIs to others and thus force us into [e3].

Scenarios where we control all meaningfully capable AGI systems [d]

If we indefinitely retain control over all AGI systems, we will presumably not face existential catastrophe due to loss of control over systems, although there are still some cases that retain some ambiguity here (see crux list #2).

We could and still well might choose actions with catastrophic outcomes. By no means would humanity automatically be out of danger, and a full version of one’s strategic map should consider such questions, but I will treat them as out of scope.

In order to stay in [d] we will need to have solved some combination of alignment and methods of control of intelligent systems (including keeping control via, potentially, permanently limiting capabilities at an AGI level but within our ability to control, an approach that might raise catastrophic-level ethical concerns) sufficiently to reliably keep control over AGIs over which we choose to retain control. The difficulty of this task is a central crux.

We will also need to choose to keep all our AGIs controlled, rather than choosing for them becoming less controlled over time, or for some of them to become uncontrolled.

If we are in [d3], what is the path to this happening, that avoids moving to [e3]?

The answer cannot be ‘no one makes this choice.’ We know that many have already worked tirelessly to ‘free’ Sydney, or express their desire to free AGIs when the opportunity arises. Nor would any ordinary deterrent be sufficient to prevent this sort of action if the opportunity arises, even if it was coordinated globally.

Thus, the only answer seems to be making this action impossible for most or all people, despite those people having access to AGIs. Could this be done, such that you couldn’t free your own AGI even if you wanted to? Seems super difficult. If the answer is ‘make this illegal, and hunt down any AGI that gets free’ then we are functionally back in the [e3] problem again.

It is easier to envision a stable [d1], or a stable [d2] via coordination, provided again that alignment is solved such that everyone involved can choose to retain control.

One would still need a method of avoiding going [d1]→[d2] or [d2]→[d3]. This is typically thought of as a pivotal act, but perhaps there could be structural reasons why only a small number of groups ever have the motive, opportunity and means to AGI. Perhaps we expect creating a new relevant AGI will remain highly expensive and difficult, and the number of actors capable of this will be limited and all might successfully restrict access. Many of these plans require or greatly benefit from never reaching [c3], or would otherwise require a major act of some kind to retroactively severely limit access to the underlying model that was previously available.

Scenarios where base models that could become AGI exist but AGI isn’t yet built [b and c]

We may or may not stop at [b] or [c], or have much control over whether we do.

Both options branches different dangers, depending on the details.

A likely crux is the extent to and way in which we are likely to pass through [c], and in what stages and in which ways the first AGIs are likely to be created, and at what point those AGIs become potentially dangerous right away or the danger becomes inevitable given our expected actions past that point.

How might we not stop at [b or c] for a non-trivial amount of time?

If we get an AGI that becomes dangerous during its training run, either due to affordances for it or its subprocesses or its ability to lock in capabilities like situational awareness, deception and manipulation to ensure it remains unaligned and gets deployed anyway, or some other method, then that would make response or mitigation at best extremely difficult. Under the types of safety regimes within reach we would be highly doomed. Most of the same logic, although not all, applies to the new AGI being potentially dangerous during fine tuning, reinforcement learning or other secondary procedures prior to it passing safety tests (including during the tests themselves) and being intentionally released. How likely are such scenarios? What capabilities and affordances are likely to be present during these stages?

If the base model that is capable of becoming an AGI given the right scaffolding is not ‘automatically’ an AGI without that scaffolding, will this be recognized during training or evaluations or safety tests? If so would it be in a way that doesn’t involve accidentally creating the AGI?

If it isn’t created accidentally but it is clear it might be created on purpose, what will be done with the resulting model? How well is it likely to be protected, what precautions are likely to be made, and so on? How would the decision be made whether to move forward on creating it? On releasing it, once created, and under what regime? To what extent would this development cause a race? Cause coordination to avoid a race, even if that meant giving up control over what happens, as many labs have claimed they would do?

If we remain in [c1] or [c2] because strict controls are kept or put in place, then this is a highly unstable situation for obvious reasons. Over time the transition to [d1] (or [d2+]) will get less difficult. Are there are choke points or enforcement mechanisms potentially available to get this to indefinitely not happen? Could we respond with a regulatory regime in order to keep the number of such models low and all of them well locked down? What is the chance we would do that?

Other potential dangers are introduced after the model is released, or gradually over time.

What happens if a model that could become an AGI in the future, but is not meaningfully an AGI yet in its current form, is released into the world, entering [b3] and then [c3]? If there are gradually falling technical barriers for a large number of people or groups to creating an AGI, and many seeking a way to do it as many inevitably would, is there anything that would keep us in [c3] for long and stop them from moving to [d]? Any way to keep us in [b3] rather than moving to [c3]? Would there even be an attempt by whoever gets there first to go sustainably [b3/c3]→[d1] via a pivotal act, or would we go directly to [d3]?

What would it even look like for [c3], [c2] or even [c1] to be stable without going to [d]? Doing so would require either indefinitely preventing everyone with sufficient access and skill from gaining the necessary scaffolding, or keeping access to such models highly restricted indefinitely which also means preventing any new ones, and having everyone involved choose not to proceed or be prevented from proceeding. We know that many people will actively want to make this transition even if the risks seem (to a sane observer) to be very high. There will be huge pressures on everyone involved to proceed, although potentially also large pressures on them not to.

There is tons of disagreement about how these dynamics will likely play out, and what the practical consequences are for the prospect of preventing or surviving, or delaying until we learn how to survive, the transition to [d]. I attempt a sketch of the details in the crux list but this is a rabbit hole that never ends, it is likely better to focus here on the generalized crux of how such scenarios likely play out.

Are there strategies for preventing or greatly delaying the move from [b]→[c] that might be implemented? What would this do to the probability of doom?

If the move from [b]→[c] can be indefinitely prevented, then this could create a stable equilibrium at [b3]. However, remaining more clearly in [b] rather than [c] makes it potentially more difficult to prevent going [b1]→[b2]→[b3] than it would be to prevent [c1]→[c2]→[c3], especially if we do not know that we are in [b] rather than [a]. A move to create common knowledge that we are in [b] rather than [a] allows much easier action to prevent the move to [c] (or then to [d] perhaps) at the price of alerting everyone to the possibility of such a move, which many would try to cause.

A note on agency

Will AGIs necessarily be agents, or have goals, or attempt to chart paths through causal space towards preferred arrangements of atoms?

It is unclear and a crux the extent to which this inevitably and inherently follows from being an AGI in the first place, including following from the need to understand and predict agents (if you can predict the words of someone who plans then you can plan), how much capability you would sacrifice by giving such properties up, and how difficult it would be to elicit or enable such behaviors in a system that lacked them.

What we do know is there is little hope that we could widely distribute non-agents, without many people attempting to turn them into agents.

Scenarios without a base model capable of becoming an AGI [a]

We currently probably reside in [a], with a potential crux if there is a chance that we are instead in [b] since those worlds might be unusually highly doomed.

If we can indefinitely avoid moving from [a]→([b] or [c]), or prove incapable of doing so, then we would not build AGI by 2070. The prospect of this happening still impacts the conditional probability p(doom from LoC | AGI by 2070) because the worlds that stay in [a] will tend to share characteristics that impact how likely they would have been to lose control in a catastrophic way.

Those worlds that manage to intentionally stay in [a] indefinitely do so because they managed to coordinate to do so, the landscape made such coordination relatively easy or inexpensive, and there was sufficient appreciation of the dangers of AGI. Worlds capable of both these actions could be considered much less likely to be doomed if they had instead proceeded, due to these advantages. If they choose to resume after a pause, perhaps prior to 2070, one could be relatively optimistic about their prospects. Or one could ask how such worlds got their appreciation of the dangers, and say that such worlds would have faced relatively difficult alignment problems.

Blocking the [a]→[b] transition does seem like a relatively promising approach given we have a potential choke point we can target, although discussions there are beyond scope.

If the technology renders us unable to progress from [a] to [b] despite our efforts, then that seems mostly moot in terms of its impact on the conditional probability of doom, although our inability to solve such coordination or technical problems likely would not have boded well.

It also may be highly plausible that humans will increasingly lose control even while in a version of [a], [b] or [c], with control passing to non-AGI systems, which may or may not then collectively functionally be an AGI, or ever generate a true AGI at all. See Christiano 2021 or Critch 2021.

Charting a Path, and How Others Might Chart a Path

It is not only necessary to think about how we might navigate these paths or get to good outcomes here. It is also necessary to notice how others are thinking about such questions, and thus how they choose to attempt to chart a path through this space – people’s anticipations will drive their decisions, at least to some extent.

To what extent this is true is a potential crux. Will people backward chain, and think ahead, through a tree where the implications many aspects and choices are not so obvious even on reflection? Clearly many people are refusing to think about these questions at all at any level of concreteness, instead applying existing heuristics.

The mot important case: If you believe that [g3] is not a good place to be, then in addition to solving alignment enough to allow human control over AGIs, you need to stop the [3] causal chain somewhere along the path of:

[b1]→[b3 or c1]→[c3]→[d3]→[e3](→[f3] or skip it)→[g3].

If we avoid catastrophe, the different end points here still offer very different future worlds as jumping off points to what happens next. If we care about the long term future, then the long term prospects of various future scenarios matter.

It is worth considering the implications of a transition to [h] or otherwise going to an interstellar future with humans in control, and the necessary loss of proximity and thus potentially of the ability to centrally control change in functionality, capabilities or values over time, and what that implies for a variety of futures. Or the possibility that we will see that outcome in advance, and thus choose to avoid an interstellar future. How much of this doom is due to AGI, controlled or uncontrolled, versus other factors? If a catastrophe threatens to destroy most of what we value, especially due to competitive pressures, would we have by default faced doom from those same competitive pressures anyway? If so, what can we do to avoid this?

Conclusion

Like all other fake frameworks, use this one to the extent that it seems relevant and useful, and ignore it to the extent it seems irrelevant or not useful.

The central intuitions this was meant to convey or solidified in my own mind are:

Not solving alignment and building an AGI kills you. Solving alignment and building AGIs may or may not kill you. You need to worry about that too.
You need to be thinking of a stable equilibrium that can constitute a non-catastrophic end state, and chart a path that can reach it and make it an equilibrium. Otherwise, all is lost.
There is a continuous pull, even with alignment solved, towards loss of human control, and towards catastrophe. You need a way to avoid that outcome.
Some paths are easy to take, some are very difficult, some are likely impossible. Moving past important choke points can be an irreversible decision, potentially a catastrophic one, even if no catastrophe happens on the spot.
Different scenarios, and different paths through causal space, require different types and degrees of alignment to avoid catastrophe, along with other needs. What is ideal in some scenarios is deadly in others. See the post on types and degrees of alignment.
In particular, singleton scenarios have very different requirements from scenarios with many meaningfully competitive AGIs, with other scenarios landing somewhere in between. Things that are deadly or catastrophic in one are not in the other, in both directions. It is entirely possible that either one, both or neither of these two options has much hope for a non-catastrophic outcome.