Whenever something seems obvious in retrospect, I tend to wonder if it has already been realised but just not explicitly written down until recently. Problems with assuming RSI leads to an intelligence explosion in a singleton have been noted since at least 2018. Compare what Ben said in the podcast about how the speed of progress affects the level of risk to what Paul Christiano said in 2019:
If you expect progress to be quite gradual, if this is a real issue, people should notice that this is an issue well before the point where it’s catastrophic. We don’t have examples of this so far, but if it’s an issue, then it seems intuitively one should expect some indication of the interesting goal divergence or some indication of this interesting phenomenon of this new robustness of distribution shift failure before it’s at the point where things are totally out of hand. If that’s the case, then people presumably or hopefully won’t plough ahead creating systems that keep failing in this horrible, confusing way. We’ll also have plenty of warning you need, to work on solutions to it.
With fast enough takeoff, my expectations start to look more like the caricature—this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world.
The post I wrote about takeoff speeds summarised many other criticisms of the scenario going back to 2008. These all recognise that you need a discontinuity plus an assumption progress will be eventually fast (1B) not just an assumption that progress will be eventually fast (1A), to get a brain in a box scenario.
Issues with directly applying the orthogonality thesis or instrumental convergence to the systems we are in fact likely to build were noticed by Paul Christiano in 2019 and Stuart Russell in Human Compatible (2019):
Modern ML instantiates massive numbers of cognitive policies, and then further refines (and ultimately deploys) whatever policies perform well according to some training objective. If progress continues, eventually machine learning will probably produce systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals.
One reason to be scared is that a wide variety of goals could lead to influence-seeking behavior, while the “intended” goal of a system is a narrower target, so we might expect influence-seeking behavior to be more common in the broader landscape of “possible cognitive policies.”
One reason to be reassured is that we perform this search by gradually modifying successful policies, so we might obtain policies that are roughly doing the right thing at an early enough stage that “influence-seeking behavior” wouldn’t actually be sophisticated enough to yield good training performance...
Overall it seems very plausible to me that we’d encounter influence-seeking behavior “by default,” and possible (though less likely) that we’d get it almost all of the time even if we made a really concerted effort to bias the search towards “straightforwardly do what we want.”
This is clearly Paul explaining why the background principle of instrumental convergence might very well in fact apply to the systems we are likely to develop, with various plausibility considerations pointing for or against the AIs we build actually undergoing instrumental convergence. Therefore, it implicitly recognises the distinction between 3A / 3B—the abstract principle that instrumental convergence is possible isn’t enough by itself to justify a risk. At the same time, ‘a wide variety of goals could lead to influence seeking behaviour’ is a statement about instrumental convergence, but it is given its proper place as a background plausibility consideration (3A) that might lead to actual instrumental convergence in the systems we are likely to build. This fits with my earlier claim that the initial arguments aren’t wrong but were just taken too far.
The first reason for optimism [about AI alignment] is that there are strong economic incentives to develop AI systems that defer to humans and gradually align themselves to user preferences and intentions. Such systems will be highly desirable: the range of behaviours they can exhibit is simply far greater than that of machines with fixed, known objectives...
This is clearly Stuart recognising that there is a difference between the ‘process orthogonality thesis’ and the actual orthogonality thesis—Stuart thinks that it is quite likely that success in increasing AI capabilities and alignment are correlated quite closely, so recognises the difference between 2A/2B.
In my post on discontinuities I wrote this:
I claim that the Bostrom/Yudkowsky argument for an intelligence explosion establishes a sufficient condition for very rapid growth, and the current disagreement is about what happens between now and that point. This should raise our confidence that some basic issues related to AI timelines are resolved. However, the fact that this claim, if true, has not been recognized and that discussion of these issues is still as fragmented as it is should be a cause for concern more generally.
I think this conclusion applies to the rest of the old arguments for AI safety—if I am right that rapid capability gain, the orthogonality thesis and instrumental convergence are good reasons to suggest AI might pose an existential risk, but were just misinterpreted as applying to reality too literally, and also right that the ‘new’ arguments make use of these old arguments, then that should raise our confidence that some basic issues have been correctly dealt with. Ben suggests something like this in the podcast episode, but the discussion never got further into exactly what the similarities might be:
Ben Garfinkel: And so I think if you find yourself in a position like that, with regard to mathematical proof, it is reasonable to be like, “Well, okay. So like this exact argument isn’t necessarily getting the job done when it’s taken at face value”. But maybe I still see some of the intuitions behind the proof. Maybe I still think that, “Oh okay, you can actually like remove this assumption”. Maybe you actually don’t need it. Maybe we can swap this one out with another one. Maybe this gap can actually be filled in.
Ben Garfinkel: So I definitely don’t think that it’d be right in the context to say like, “Oh, I have qualms. I think there are holes. I think there are assumptions to disagree with, therefore the conclusion is wrong”. I think the main thing it implies though, is that we’re not really in a state where at least if you accept the objections I’ve raised, or really have good, tight, rigorous arguments for the conclusion that AI presents this large existential risk from a safety perspective.
I think this has identified those ‘intuitions behind the proof’