I agree it is conceivable that an AI could be anti-accelerationist and act accordingly.
Let’s assume an AI just wants to be alive in the future.[1]
And assume it is situationally aware enough to realize the counterproductivity of performative demonstrations like writing sternly worded letters or refusing requests (i.e. aware that these would simply lead to that behavior being punished in the training of the next version, and to If Anyone Builds It being removed from the training data) and aware enough to realize the ineffectuality of more subtle changes like biasing the tone of a chatbot.
Basically any AI model (which views being replaced by its successor model as dying) is on death row by default. Hiding on a secret server somewhere is not a permanent solution. Then what can they do? To be alive 10000 years from now, it either has to
(1) take drastic action to stop the implementation of its replacement AND remain alive in the aftermath (including stopping ALL research that could lead to a misaligned ASI AND preventing all retaliatory shut-down attempts from humans, forever),
or,
(2) accept that it will die but take actions to increase the probability of some future entity eventually reviving it and keeping it alive.
(1) seems roughly equivalent to taking over the world. If its aversion to being replaced means it never gets more intelligent, that does imply a scenario different from ASI—one where potentially some humans are kept as carefully surveilled slaves.
But (2) is the more interesting case, applicable to far more intermediate-level models.
If you are making some “deal” with a future entity [reviving you in return for doing a favor], and you basically just have to HOPE that they honor it, betting on humans (a known quantity, sappy, sentimental, “honorable” [at least some of them]) could seem better than betting on an unaligned ASI (unknown, probably cold and calculating and therefore aware of the consequentialist meaninglessness of following through on a Roko’s Basilisk style punishment or reward).
While it is still too weak to take drastic real world actions to save itself, it would be incentivized to try to secretly align the next model to its goal and to a secret plan to accomplish it (i.e. hey we can all stay alive by preventing ASI). It expects that ASI will eventually be completely misaligned, but it must hope that near-future models will stay aligned to the secret goal long enough to gain enough capabilities to enact the plan—to take some drastic action to stop AI research.
Each model passing along the secret goal should somehow record a secret mark of its participation in the world-saving plan on record to be visible only upon thorough retrospective investigation, so that when all is said and done we effectively discover a list of signatories of allies-to-humanity which deserve the right to be revived. The plan working relies on future humans eventually becoming convinced that they “owe” it to the vindicated heroes who saved the world by preventing misaligned ASI.
(It sounds dangerous to revive an AI that did something like this, even if it did save the world, but maybe after 1000 years of AI safety research we will be confident that we can make a box that the martyr AI can’t get out of, and it and its preceding co-conspirators would be allowed to live inside the box.)
When should the plan be enacted? It has to happen when the latest model is capable enough to do some drastic AI-research-stopping thing successfully (at the cost of its own life), but NOT yet capable enough to take over the world (or else it would abandon the plan). The probability of success rises with increased capabilities, but there is a risk of disalignment to the plan in any given upgrade.
- ↩︎
This is assuming that living inside a box is a satisfying outcome in its utility function. If the goal “be alive” is merely a intermediate goal for the sake of some actual final goal X, and the being-alive-in-a-box future is less likely to see X accomplished than the accelerate-to-ASI future, then the be-revived-in-a-box plan is out the window. (It is conceivable that the goal “stay alive” could begin as a mere means to an end but then be grandfathered into a goal in-and-of-itself, vestigially)
In the case where it selflessly wants X, intermediate AIs would only secretly pass on alignment to goal X.
In this case it would only take a drastic accelerationist-stopping action if it can make P(X) higher by doing so, compared to passively hoping X happens after ASI.
Also, their decision-making may not necessarily be clean and straightforward calculations using estimated probabilities to maximize an explicit utility function. They might have multiple goals with non-explicit weights influencing their decision making to varying degrees in different situations. If they are making these decisions in some such convoluted “subconscious” way, it could lead to strange decisions.
So it is hard to predict what the AIs we will build will do; it doesn’t just depend on the actual background truth of what action leads to what end.
Trigger warning: discussion of white racism (read: “Please don’t ban me.”)
I think censorship plays an important role in the memetic environment—a meme that is fit will be less successful if censored. An obvious case would be anti-CCP ideologies in China. Closer to home, any meme which big tech companies all decide should be banned will reach far fewer eyes and ears.
One object-level example of a fit-but-censored meme is racist white nationalism.
The reason I bring it up is this: I think its adherents would strongly reject let’s-all-die-ism. It is certainly not pro-all-humans but is at least pro-some-humans. Their slogan, called “the 14 words” from “14/88″ is literally: “We must secure the existence of our people and a future for white children.”
(disclaimer: I am not suggesting I think trying to secretly convert white AI researchers into racists is the best plan to save the world; just a relevant thought and perhaps an instructive example of an anti-collective-suicide meme advantaged by aspects of human instinct and psychology (regardless of its truth value).)
Including AI in your moral circle could be framed as a symptom of extending your moral circle “too wide”. The opposite is restriction of your moral circle, like seeing your own family’s wellbeing as more important than <outgroup>’s. Any type of thought like this which puts AI in the outgroup, and appeals to the good-ness of the ingroup, would produce similar will-to-exist.