Against most, but not all, AI risk analogies
I dislike most AI risk analogies that I’ve seen people use. While I think analogies can be helpful for explaining concepts to people and illustrating mental pictures, I think they are frequently misused, and often harmful. At the root of the problem is that analogies are consistently mistaken for, and often deliberately intended as arguments for particular AI risk positions. And a large fraction of the time when analogies are used this way, I think they are misleading and imprecise, routinely conveying the false impression of a specific, credible model of AI, even when no such credible model exists.
Here is a random list of examples of analogies that I found in the context of AI risk (note that I’m not saying these are bad in every context):
Stuart Russell: “It’s not exactly like inviting a superior alien species to come and be our slaves forever, but it’s sort of like that.”
Rob Wiblin: “It’s a little bit like trying to understand how octopuses are going to think or how they’ll behave — except that octopuses don’t exist yet, and all we get to do is study their ancestors, the sea snail, and then we have to figure out from that what’s it like to be an octopus.”
Eliezer Yudkowsky: “The character this AI plays is not the AI. The AI is an unseen actress who, for now, is playing this character. This potentially backfires if the AI gets smarter.”
Nate Soares: “My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology [...] And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF.”
Norbert Wiener: “when a machine constructed by us is capable of operating on its incoming data at a pace which we cannot keep, we may not know, until too late, when to turn it off. We all know the fable of the sorcerer’s apprentice...”
Geoffry Hinton: “It’s like nuclear weapons. If there’s a nuclear war, we all lose. And it’s the same with these things taking over.”
Joe Carlsmith: “I think a better analogy for AI is something like an engineered virus, where, if it gets out, it gets harder and harder to contain, and it’s a bigger and bigger problem.”
Ajeya Cotra: “Corporations might be a better analogy in some sense than the economy as a whole: they’re made of these human parts, but end up pretty often pursuing things that aren’t actually something like an uncomplicated average of the goals and desires of the humans that make up this machine, which is the Coca-Cola Corporation or something.”
SKLUUG: “AI risk is like Terminator! AI might get real smart, and decide to kill us all! We need to do something about it!”
These analogies cover a wide scope, and many of them can indeed sometimes be useful in conveying meaningful information. My point is not that they are never useful, but rather that these analogies can be shallow and misleading. The analogies establish almost nothing of importance about the behavior and workings of real AIs, but nonetheless give the impression of a model for how we should think about AIs.
And notice how these analogies can give an impression of a coherent AI model even when the speaker is not directly asserting it to be a model. Regardless of the speaker’s intentions, I think the actual effect is frequently to plant a detailed-yet-false picture in the audience’s mind, giving rise to specious ideas about how real AIs will operate in the future. Because the similarities are so shallow, reasoning from these analogies will tend to be unreliable.
A central issue here is that these analogies are frequently chosen selectively — picked on the basis of evoking a particular favored image, rather on the basis of identifying the most natural point of comparison possible. Consider this example from Ajeya Cotra,
Rob Wiblin: I wanted to talk for a minute about different analogies and different mental pictures that people use in order to reason about all of these issues. [...] Are there any other mental models or analogies that you think are worth highlighting?
Ajeya Cotra: Another analogy that actually a podcast that I listen to made — it’s an art podcast, so did an episode on AI as AI art started to really take off — was that it’s like you’re raising a lion cub, or you have these people who raise baby chimpanzees, and you’re trying to steer it in the right directions. And maybe it’s very cute and charming, but fundamentally it’s alien from you. It doesn’t necessarily matter how well you’ve tried to raise it or guide it — it could just tear off your face when it’s an adult.
Is there any reason why Cotra chose “chimpanzee” as the point of comparison when “golden retriever” would have been equally valid? It’s hard to know, but plausibly, she didn’t choose golden retriever because that would have undermined her general thesis.
I agree that if her goal was to convey the logical possibility of misalignment, then the analogy to chimpanzees works well. But if her goal was to convey the plausibility of misalignment, or anything like a “mental model” of how we should think of AI, I see no strong reason to prefer the chimpanzee analogy over the golden retriever analogy. The mere fact that one analogy evokes a negative image and the other evokes a positive image seems, by itself, no basis for any preference in their usage.
Or consider the analogy to human evolution. If you are trying to convey the logical possibility of inner misalignment, the analogy to human evolution makes sense. But if you are attempting to convey the plausibility of inner misalignment, or a mental model of inner misalignment, why not choose instead to analogize the situation to within-lifetime learning among humans? Indeed, as Quintin Pope has explained, the evolution analogy seems to have some big flaws:
“human behavior in the ancestral environment” versus “human behavior in the modern environment” isn’t a valid example of behavioral differences between training and deployment environments. Humans weren’t “trained” in the ancestral environment, then “deployed” in the modern environment. Instead, humans are continuously “trained” throughout our lifetimes (via reward signals and sensory predictive error signals). Humans in the ancestral and modern environments are different “training runs”.
As a result, human evolution is not an example of:
We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.
It’s an example of:
We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.
Many proponents of AI risk seem happy to critique analogies when they don’t support the desired conclusion, such as the anthropomorphic analogy. Sometimes, they will even critique the analogies they imagine other people use, such as “it’s like a toaster” or “it’s like Google Maps”. And of course, in these cases, they can easily identify the flaws:
Ajeya Cotra: I think the real disanalogy between Google Maps and all of this stuff and AI systems is that we are not producing these AI systems in the same way that we produced Google Maps: by some human sitting down, thinking about what it should look like, and then writing code that determines what it should look like.
To be clear, I agree Google Maps is a bad analogy. But is the chimp analogy really so much better? Shouldn’t we be applying the same degree of rigor against our own analogies too?
My point is not “use a different analogy”. My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!
ETA: To clarify, I’m not against using analogies in every case. I’m mostly just wary of having our arguments depend on analogies, rather than detailed models. See this footnote for more information about how I view the proper use of analogies.
While the purpose of analogies is to provide knowledge in place of ignorance — to explain an insight or a concept — I believe many AI risk analogies primarily misinform or confuse people rather than enlighten them; they can insert unnecessary false assumptions in place of real understanding. The basic concept they are intended to convey may be valuable to understand, but riding along with that concept is a giant heap of additional speculation.
Part of this is that I don’t share other people’s picture about what AIs will actually look like in the future. This is only a small part of my argument, because my main point is that that we should rely much less on arguments by analogy, rather than switch to different analogies that convey different pictures. But this difference in how I view the future still plays a significant role in my frustration at the usage of AI risk analogies.
Maybe you think, for example, that the alien and animal analogies are great for reasons that I’m totally missing. But it’s still hard for me to see that. At least, let me compare my picture, and maybe you can see where I’m coming from.
Again: The next section is not an argument. It is a deliberately evocative picture, to help compare my expectations of the future against the analogies I cited above. My main point in this post is that we should move away from a dependence on analogies, but if you need a “picture” of what I expect from AI, to compare it to your own, here is mine.
The default picture, as I see it — the thing that seems to me like a straightforward extrapolation of current trends in 2024 into the medium-term future, as AIs match and begin to slightly exceed human intelligence — looks nothing like the caricatures evoked by most of the standard analogies. In contrast to the AIs-will-be-alien model, I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world. They will be socially integrated with us and will likely substantially share our concepts about the social and physical world, having been trained on our data and being fluent in our languages. They will be numerous and everywhere, interacting with us constantly, assisting us, working with us, and even providing friendship to hundreds of millions of people. AIs will be evaluated, inspected, and selected by us, and their behavior will be determined directly by our engineering.
I feel this picture is a relatively simple extension of existing trends, with LLMs already being trained to be kind and helpful to us, and collaborate with us, having first been shaped by our combined cultural output. I expect this trend of assimilation into our society will intensify in the foreseeable future, as there will be consumer demand for AIs that people can trust and want to interact with. Progress will likely be incremental rather than appearing suddenly with the arrival of a super-powerful agent. And perhaps most importantly, I expect oversight and regulation will increase dramatically over time as AIs begin having large-scale impacts.
It is not my intention to paint a picture of uniform optimism here. There are still plenty of things that can go wrong in the scenario I have presented. And much of it is underspecified because I simply do not know what the future will bring. But at the very least, perhaps you can now sympathize with my feeling that most existing AI risk analogies are deeply frustrating, given my perspective.
Again, I am not claiming analogies have no place in AI risk discussions. I’ve certainly used them a number of times myself. But I think they can, and frequently are used carelessly, and seem to regularly slip various incorrect illustrations of how future AIs will behave into people’s mental models, even without any intent from the person making the analogy. In my opinion, it would be a lot better if, overall, we reduced our dependence on AI risk analogies, and in their place substituted them with specific object-level points.
To be clear, I’m not against all analogies. I think that analogies can be good if they are used well in context. More specifically, analogies generally serve one of three purposes:
1. Explaining a novel concept to someone
2. Illustrating, or evoking a picture of a thing in someone’s head
3. An example in a reference class, to establish a base rate, or otherwise form the basis of a model
I think that in cases (1) and (2), analogies are generally bad as arguments, even if they might be good for explaining something. They’re certainly not bad if you’re merely trying to tell a story, or convey how you feel about a problem, or convey how you personally view a particular thing in your own head.
In case (3), I think analogies are generally weak arguments, until they are made more rigorous. Moreover, when the analogy is used selectively, it is generally misleading. The rigorous way of setting up this type of argument is to deliberately try to search for all relevant examples in the reference class, without discriminating in favor of ones that merely evoke your preferred image, to determine the base rate.