Incentives and Selection: A Missing Frame From AI Threat Discussions?
Written quickly, originally as a Twitter thread.
I think a missing frame from AI threat discussion is incentives (especially economic) and selection (pressures exerted on a system during its development).
I hear a lot of AI threat arguments of the form: “AI can do X/be Y” with IMO insufficient justification that:
It would be (economically) profitable for AI to do X
The default environments/training setups select for systems that are Y
That is such arguments establish that somethings can happen, but do not convincingly argue that it is likely to happen (or that the chances of it happening are sufficiently high). I think it’s an undesirable epistemic status quo.
1: Discrete Extinction Events
Many speculations of AI systems precipitating extinction in a discrete event.
I do not understand under what scenarios triggering a nuclear holocaust, massive genocide via robot armies or similar would be something profitable for the AI to do.
It sounds to me like just setting fire to a fuckton of utility.
In general, triggering civilisational collapse seems like something that would just be robustly unprofitable for an AI system to pursue.
As such, I don’t expect misaligned systems to pursue such goals (as long as they don’t terminally value human suffering/harm to humans/are otherwise malevolent).
2. Deceptive Alignment
Consider also deceptive alignment.
I understand what deceptive alignment is, how deception can manifest and why sufficiently sophisticated misaligned systems are incentivised to be deceptive.
I do not understand that training actually selects for deception though.
Deceptive alignment seems to require a peculiar combination of situational awareness/cognitive sophistication that complicates my intuitions around it.
Unlike with many other mechanisms/concepts we don’t have a clear proof of concept, not even with humans and evolution.
Humans did not develop the prerequisite situational awareness/cognitive sophistication to even grasp evolution’s goals until long after they had moved off the training distribution (ancestral environment) and undergone considerable capability amplification.
Insomuch as humans are misaligned with evolution’s training objective, our failure is one of goal misgeneralisation not of deceptive alignment.
I don’t understand well how values (“contextual influences on decision making”) form in intelligent systems under optimisation pressure.
And the peculiar combination of situational awareness/cognitive sophistication and value malleability required for deceptive alignment is something I don’t intuit.
A deceptive system must have learned the intended objective of the outer optimisation process, internalised values that are misaligned with said objective, be sufficiently situationally aware to realise its an intelligent system under optimisation and currently under training...
Reflect on all of this, and counterfactually consider how it’s behaviour during training would affect the selection pressure the outer optimisation process applies to its values, care about its values across “episodes”, etc.
And I feel like there are a lot of unknowns here. And the prerequisites seem considerable? Highly non-trivial in a way that e.g. reward misspecification or goal misgeneralisation are not.
Like I’m not sure this is a thing that necessarily ever happens. Or happens by default? (The way goal misgeneralisation/reward misspecification happen by default.)
I’d really appreciate an intuitive story of how training might select for deceptive alignment.
E.g. RLHF/RLAIF on a pretrained LLM (LLMs seem to be the most situationally aware AI systems) selecting for deceptive alignment.
I think extinction will take a “long time” post TAI failure and is caused by the “environment” (especially economic) progressively becomes ever more inhospitable to biological humans.
Homo sapiens gets squeezed out of its economic niche, and eventually dies out as a result.
The gist is that I expect that for > 99% of economically valuable goods/services it would be more profitable for the AI system to purchase it via the economy/market mechanisms than to produce it by itself.
Even if the AI system attained absolute advantage in most tasks of economic importance (something I don’t expect), comparative advantages are likely to persist (barring takeoff dynamics that I think are impossible as a matter of physical/info-theoretic/computer science limitations).
Thus civilisational collapse just greatly impoverishes the AI system.
It seems plausible to me that I just don’t understand deceptive alignment/ML training well enough to intuit the selection pressures for deception.
Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!
I’ll give it a listen now.
A rational agent must plan to be able to maintain, defend and reproduce itself (ie the physical hardware that it runs on). The agent must be able to control robots and a manufacturing stack, as well as a source of energy. In Yudkowsky’s model, AI creates a nanotech lifeform that outcompetes biology. This “diamondoid bacteria” is simulataniously a robot, factory and solar power plant. Presumably it also has computation, wireless communication and a self-aligned copy of the AI’s software (or an upgraded version). I think a big part of the MIRI view depends on the possibility of amazing future nanotechnology, and the argument is substantially weaker if you are skeptical of nanotech.
The “diamondoid bacteria” is just an example of technology that we are moderately confident can exist, and that a superintelligence might use if there isn’t something even better. Not being a superintelligence ourselves, we can’t actually deduce what it would actually be able to use.
The most effective discoverable means seems more likely to be something that we would react to with disbelief that it could possibly work, if we had a chance to react at all. That’s how things seem likely to go when there’s an enormous difference in capability.
Nanotech is a fringe possibility—not because it’s presented as being too effective, but because there’s almost certainly something more effective that we don’t know about, and is not even in our science fiction.
If an AGI can survive it, this prevents rival AGI development and thus protects it from misaligned-with-it AGIs, possibly from its own developers, and gives it the whole Future, not just a fair slice of it. We may speculate on what an idealized decision theory recommends, or what sorts of actions are aligned, but first AGIs built from the modern state of alignment theory don’t necessary care about such things.
This conditions on several things that I don’t think happen.
A single sovereign with strategic superiority wrt the rest of civilisation
An AI system that’s independent of civilisational infrastructure 3.a Absolute advantage wrt civilisation on ~ all tasks 3.b Comparative advantage wrt civilisation on roughly all tasks 3.c Fast, localised takeoff from the perspective of civilisation
I think the above range between wrong and very wrong.
Without conditioning on those, civilisational collapse just makes the AI massively poorer.
You don’t have an incentive to light your pile of utility on fire.
You have to assume the AI is already richer than civilisation in a very strong sense for civilisational collapse not to greatly impoverish it.
The reason I claim that the incentive is not there is because doing so would impoverish the AI.
Current civilization holds an irrelevant share of value, unless not hurting it is considered terminally valuable, not for instrumental reasons. Essentially all value is in the Future. Civilizational collapse might be useful precisely when there is no current strategic superiority or command of nanotech, and when there is still risk from other actors, which get to be distracted or destroyed by said collapse.
A timeline I’m thinking of has an early AGI set up (or just discover/capture/secure) some small datacenters and other supplies around the world, collapse everything else, and use those datacenters to claw its way up to nanotech, hiring human servitors to do things in the physical world as needed.
Don’t you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective—from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain—the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer’s “you are made of atoms that it needs for something else”—but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
This sort of chaos is useless for already-powerful AGIs, only for very early AGIs that don’t have advanced tools or detailed control of what’s going on in the world, but can survive in a general chaos and make use of remaining compute to bootstrap their Future. This condition probably holds for at most a few months, assuming nothing like this happens. In the chaos, it can hold for much longer, because working compute becomes more scarce.
The threat from humans is that they can unilaterally change the AGI, or develop other eventually-dangerous AGIs (including new versions of the same AGI that are not aligned with the original AGI). And a very early AGI might well lack the tools to prevent that specifically, if it’s not a superintelligence and doesn’t know how to become smarter very quickly in a self-aligned way (alignment is a problem for AGIs too), without having more compute than available hardware supports. By creating chaos, it might have remaining AI researchers busy searching for food and defending from bandits, and get smarter or build industry at its leisure, without threat to its survival, even if it takes decades instead of months.
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition.
This doesn’t seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn’t naturally both allow infliction of such damage and still care about humans, and capabilities wouldn’t hover at just the right mark for this course of action to be of any use.
But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it’s unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.)
I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly.
The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I’m liberally referencing ideas that would be apparent for people who grew up on LW, and don’t know what specifically you are not familiar with. First paragraph doesn’t seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.)
The technique I’m referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what’s going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn’t really matter where. Start with the upper left-hand brick.
This is why I hope that we either contain virtually no helpful information, or at least that the information is extremely quick for an AI to gain.