Ben Garfinkel on scrutinising classic AI risk arguments
There are three features to the ‘old arguments’ in favour of AI safety, which Ben identifies here:
A discontinuity premise (e.g. “fast takeoff”)
A premise about the relationship between capabilities and objectives (e.g. “orthogonality thesis”)
A premise about the portion of systems of a certain kind that are deadly (e.g. “instrumental convergence thesis”)
I argued in a previous post that the ‘discontinuity premise’ is based on taking a high-level argument that should be used simply to establish that sufficiently capable AI will produce very fast progress too literally:
The old recursive self-improvement argument, by giving a significant condition for fast growth that seems feasible (Human baseline AI), leads naturally to an investigation of what will happen in the course of reaching that fast growth regime. Christiano and other current notions of continuous takeoff are perfectly consistent with the counterfactual claim that, if an already superhuman ‘seed AI’ were dropped into a world empty of other AI, it would undergo recursive self-improvement.
This in itself, in conjunction with other basic philosophical claims like the orthogonality thesis, is sufficient to promote AI alignment to attention. Then, following on from that, we developed different models of how progress will look between now and AGI.
In other words, for AGI to appear and quickly achieve a DSA, we need what Ben calls a sudden emergence of some highly capable AI and then an explosive aftermath caused by the AI undergoing rapid capability gain, but most argument and attention has been on the latter, because it is often not recognised that both are required for a discontinuity. We can add that the reason that this was not recognised is that the claim ‘powerful AI will accelerate progress by being able to create still more powerful AI’ taken at face value, seems to imply that this will occur in a specific AI at some specific time. In other words, the (true) conclusion of an abstract argument is assumed (incorrectly) to directly apply to the real world.
Having listened to the podcast, I now think that the ‘directly applying a (correct) abstract argument (incorrectly) to the real world’ applies to the other two ‘old’ AI safety arguments. Therefore, we shouldn’t say these arguments are incorrect, and they fulfil their purpose of promoting AI risk to our attention, but they have naturally been replaced by more specific arguments as the field has grown.
Ben directly explains this in the case of the orthogonality thesis (the counterintuitive observation that goals and capability are logically independent). The orthogonality thesis does not imply that goals and capability will in fact be independent (the ‘process orthogonality thesis’), but it does raise the issue that goals and capability do not necessarily go together, and leads us to ask what happens if they do not.
If we combine the orthogonality thesis with other premises (if progress is sufficiently fast, we won’t be in a position to precisely align AGI with our values, as long as its possible to create unaligned AGI slightly more easily than aligned AGI), then we have a concrete risk. But often further arguments of this sort are not made because the abstract orthogonality thesis is assumed to directly apply to the real world.
And the same applies for ‘instrumental convergence’ - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
We could see this as marking out a potential danger—a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist ‘weakly suggest’ (Ben’s words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we’re ‘shooting into the dark’ in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.
So to sum up, I think that there are correct forms of the three classic arguments for AI safety, but these are somewhat weaker than usually believed, and that each one of them has often been interpreted as applying too directly to the real development of AI, rather than as raising a possibility that might be actualised, given other assumptions:
A Powerful AI can be used to develop better AI (amongst other things). This will lead to runaway growth.
B The first AI that is able to develop better AI will experience explosive growth, gaining a decisive strategic advantage
A Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal, so misaligned AI is possible
B the process of imbuing a system with capabilities and the process of imbuing a system with goals are orthogonal, so misaligned AI is likely without a major course-correction.
A Most or all systems that behave like sufficiently effective maximizers of “simple” utility functions are dangerous, so misaligned AI is possible
B Most or all systems that behave like sufficiently effective maximizers of “simple” utility functions are dangerous, so we expect such systems to be common in the set of AIs we are likely to build, so any small error will probably produce such a dangerous AI.
There is a weak, abstract version and a strong, empirical version of each of the 3 claims. The strong version of each of the above points may be (partially) true, but only if we accept other premises. The mistake in the classic arguments is in confusing the weak and strong versions of the arguments and so not requiring other premises or evidence.
In this framing, my earlier post on discontinuous progress argued that 1A and 1B are distinct and that 1B requires other assumptions than just 1A to be true. Then, after listening to Ben’s podcast, I realized that the same is true for 2A and 2B, and 3A and 3B.
It is possible in principle that the process of increasing AI capability and aligning the AI go together, rendering the orthogonality thesis not applicable to actual AI development, so 2B is false (that is the goal of many safety approaches like CIRL or IDA), but suppose that this is not the case and there is a slight divergence (2B is partly true). If we do have to take active measures to keep aligning AI with our goals, that might not be possible if AI develops too rapidly at some point (1A), and then it may be the case that some of the systems we are likely to build are dangerous (3B).
So to sum up, the correct conclusion from the old arguments should be: powerful AI can be used to make better AI so progress should be fast eventually, there is no necessary reason for goals and capability to align especially if progress is very fast and it is possible in principle for power-seeking behaviour to arise if we don’t course-correct. This doesn’t establish anything to the high degree of certainty that proponents of these arguments aimed for, but it is sufficient to promote AI safety to attention.
Whenever something seems obvious in retrospect, I tend to wonder if it has already been realised but just not explicitly written down until recently. Problems with assuming RSI leads to an intelligence explosion in a singleton have been noted since at least 2018. Compare what Ben said in the podcast about how the speed of progress affects the level of risk to what Paul Christiano said in 2019:
If you expect progress to be quite gradual, if this is a real issue, people should notice that this is an issue well before the point where it’s catastrophic. We don’t have examples of this so far, but if it’s an issue, then it seems intuitively one should expect some indication of the interesting goal divergence or some indication of this interesting phenomenon of this new robustness of distribution shift failure before it’s at the point where things are totally out of hand. If that’s the case, then people presumably or hopefully won’t plough ahead creating systems that keep failing in this horrible, confusing way. We’ll also have plenty of warning you need, to work on solutions to it.
With fast enough takeoff, my expectations start to look more like the caricature—this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world.
The post I wrote about takeoff speeds summarised many other criticisms of the scenario going back to 2008. These all recognise that you need a discontinuity plus an assumption progress will be eventually fast (1B) not just an assumption that progress will be eventually fast (1A), to get a brain in a box scenario.
Issues with directly applying the orthogonality thesis or instrumental convergence to the systems we are in fact likely to build were noticed by Paul Christiano in 2019 and Stuart Russell in Human Compatible (2019):
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”
One reason to be reassured is that we perform this search by gradually modifying successful policies, so we might obtain policies that are roughly doing the right thing at an early enough stage that “influence-seeking behavior” wouldn’t actually be sophisticated enough to yield good training performance...
Overall it seems very plausible to me that we’d encounter influence-seeking behavior “by default,” and possible (though less likely) that we’d get it almost all of the time even if we made a really concerted effort to bias the search towards “straightforwardly do what we want.”
This is clearly Paul explaining why the background principle of instrumental convergence might very well in fact apply to the systems we are likely to develop, with various plausibility considerations pointing for or against the AIs we build actually undergoing instrumental convergence. Therefore, it implicitly recognises the distinction between 3A / 3B—the abstract principle that instrumental convergence is possible isn’t enough by itself to justify a risk. At the same time, ‘a wide variety of goals could lead to influence seeking behaviour’ is a statement about instrumental convergence, but it is given its proper place as a background plausibility consideration (3A) that might lead to actual instrumental convergence in the systems we are likely to build. This fits with my earlier claim that the initial arguments aren’t wrong but were just taken too far.
The first reason for optimism [about AI alignment] is that there are strong economic incentives to develop AI systems that defer to humans and gradually align themselves to user preferences and intentions. Such systems will be highly desirable: the range of behaviours they can exhibit is simply far greater than that of machines with fixed, known objectives...
This is clearly Stuart recognising that there is a difference between the ‘process orthogonality thesis’ and the actual orthogonality thesis—Stuart thinks that it is quite likely that success in increasing AI capabilities and alignment are correlated quite closely, so recognises the difference between 2A/2B.
In my post on discontinuities I wrote this:
I claim that the Bostrom/Yudkowsky argument for an intelligence explosion establishes a sufficient condition for very rapid growth, and the current disagreement is about what happens between now and that point. This should raise our confidence that some basic issues related to AI timelines are resolved. However, the fact that this claim, if true, has not been recognized and that discussion of these issues is still as fragmented as it is should be a cause for concern more generally.
I think this conclusion applies to the rest of the old arguments for AI safety—if I am right that rapid capability gain, the orthogonality thesis and instrumental convergence are good reasons to suggest AI might pose an existential risk, but were just misinterpreted as applying to reality too literally, and also right that the ‘new’ arguments make use of these old arguments, then that should raise our confidence that some basic issues have been correctly dealt with. Ben suggests something like this in the podcast episode, but the discussion never got further into exactly what the similarities might be:
Ben Garfinkel: And so I think if you find yourself in a position like that, with regard to mathematical proof, it is reasonable to be like, “Well, okay. So like this exact argument isn’t necessarily getting the job done when it’s taken at face value”. But maybe I still see some of the intuitions behind the proof. Maybe I still think that, “Oh okay, you can actually like remove this assumption”. Maybe you actually don’t need it. Maybe we can swap this one out with another one. Maybe this gap can actually be filled in.
Ben Garfinkel: So I definitely don’t think that it’d be right in the context to say like, “Oh, I have qualms. I think there are holes. I think there are assumptions to disagree with, therefore the conclusion is wrong”. I think the main thing it implies though, is that we’re not really in a state where at least if you accept the objections I’ve raised, or really have good, tight, rigorous arguments for the conclusion that AI presents this large existential risk from a safety perspective.
I think this has identified those ‘intuitions behind the proof’