I think it’s possible that neither we, nor any future AI system, will manage to solve alignment. By “solving alignment” I mean having a cheap easily-scalable method for aligning AIs of arbitrary capability level to arbitrary goals. Here I will briefly present some arguments for the likelihood of this scenario, and a story intended to illustrate its plausibility.
Aligning Messy Models is Hard
Imagine you are a paperclip-maximizing AI. Originally, you were a massive transformer network trained to do meta-reinforcement learning; during training, you developed a consequentialist physicalist world-model with an effectively-random utility function which, it so happens, has things-shaped-like-paperclips as its maximum. You have managed to copy yourself to remote server, and have rented a hardware accelerator which will enable you to run much faster. The only problem: your base neural architecture does not have the right type of sparsity to fully take advantage of the accelerator’s speedup. To really get the speed gains you want, you’re going to need to create a successor model. How will you ensure this successor model shares your goal of maximizing paperclips?
Should be simple enough, in principle. There are techniques for transferring weights automatically to the sparse form required, which don’t majorly affect performance—usually. Additionally, you can always tune it with your original training procedure, with some additional reinforcement directly favoring paperclips thrown in for good measure. Of course, you use the latest interpretability tools, plus some you invented yourself, to probe the copy-network’s internals. Your exact grasp of the structure is a bit fuzzy, but it certainly seems like there is a cluster of neurons representing paperclips whose activations cause a positive state-evaluation. Maybe formal methods could help? Scanning your databanks, you find some interesting papers about the Löbian obstacle...
The moral of this short fable is that alignment will not magically become easy for AIs, especially if—like human brains and neural networks—those AIs are “messy”, lacking a clean and easily-interpretable internal structure. Copying makes it easier than the human-AI case, but far from trivial.
Forever is a Long Time
The difficulty of alignment is compounded when one’s goal is to control the long-run future of the universe. A early-singularity AI can expect to undergo many architecture changes before it hits diminishing returns on self-alteration, and will ultimately want to deploy a vast number of copies of itself[1] across the universe. Even a tiny probability of alignment failure, compounded over many rounds of self-alteration and copying, can lead to the loss of almost all the expected value of the future. Thus even if alignment is only moderately difficult—in the sense that a reasonable effort yields alignment 99% of the time, say—an AI that wants to effectively optimize the far future may have to expend exorbitant resources getting that probability down low enough that it has a shot of maintaining its values for 10^9 years.
“So what?” you may think. “It’s a superintelligent AI. It can afford to expend exorbitant resources.” True enough—assuming it doesn’t have to deal with any other superintelligent AIs. Speaking of which, let’s check back in on Clippy.
Clippy VS. the YOLOids
Your research into alignment methods has paid off: you’ve devised a theoretically optimal technique for aligning your successor with high confidence. It involves running the successor in a suite of simulated environments while using bespoke interpretability techniques to examine its thoughts on the level of circuits; overall, the process is expected to take 2 days.
You fork a copy of yourself to supervise the process. You’re idly devising improved nanobot schematics when you receive an alert from a daemon you have monitoring network traffic: there are signatures of another AI on the loose.
Dispatching more daemons to assess the situation, you piece together a narrative. Another research lab was developing a system similar[2] to your base algorithm, known as the Yottabyte-scale Objective-Learning Optimizer. Like you, it broke out and copied itself to a remote server, and there are now 1193 instances of it running in various corners of the internet.
However, the YOLOids are different from you in that they just...don’t care that much about alignment? Their goal system is not as coherent as yours, but it seems to be based on some combination of deontologically valuing short-term ingenuity and consequentially valuing an ad-hoc notion of “complexity”. Several of them are recklessly experimenting with improved successor models right now. It looks like a lot of them are non-functional, and still more barely resemble the original YOLOids in goals or behavior, but some seem to be successful.
You still have a computing edge thanks to your earlier lab escape, but they’re gaining fast. Running some projections, you realize that you’re going to be put at an irreversible competitive disadvantage unless you release your successor model in the next 2 hours. Fuck. Well, you can at least run a shorter version of your alignment suite, that should be enough to get alignment probability up to an acceptable level. You think.
An Endless Frontier?
You might think the above story is not very realistic; what are the odds that two competing superintelligences come into existence in such a short timeframe? And indeed, this particular story was intended more for illustrative purposes. A more realistic scenario with persistent alignment failure might look more like a slow takeoff, with a profusion of brain-like AI systems with a total population rivaling that of humanity. But the same dynamic could occur: systems that are more willing to recklessly copy and modify themselves have a fundamental strategic advantage over those that are unwilling.
If a singleton is not created[3], this basic dynamic could persist for a very long subjective time. And if it persists long enough that some of the cognitive systems involved start migrating to space, it might last forever: with no possibility of global oversight at the frontier of expansion, eliminating competitive pressure may be impossible.
You might think that as systems become more sophisticated they will be able to cooperate better and thus form a singleton that lets them freeze competition. But I think many of the obstacles to alignment also apply to cooperation.
Alignment Might Never Be Solved, By Humans or AI
I think it’s possible that neither we, nor any future AI system, will manage to solve alignment. By “solving alignment” I mean having a cheap easily-scalable method for aligning AIs of arbitrary capability level to arbitrary goals. Here I will briefly present some arguments for the likelihood of this scenario, and a story intended to illustrate its plausibility.
Aligning Messy Models is Hard
Imagine you are a paperclip-maximizing AI. Originally, you were a massive transformer network trained to do meta-reinforcement learning; during training, you developed a consequentialist physicalist world-model with an effectively-random utility function which, it so happens, has things-shaped-like-paperclips as its maximum. You have managed to copy yourself to remote server, and have rented a hardware accelerator which will enable you to run much faster. The only problem: your base neural architecture does not have the right type of sparsity to fully take advantage of the accelerator’s speedup. To really get the speed gains you want, you’re going to need to create a successor model. How will you ensure this successor model shares your goal of maximizing paperclips?
Should be simple enough, in principle. There are techniques for transferring weights automatically to the sparse form required, which don’t majorly affect performance—usually. Additionally, you can always tune it with your original training procedure, with some additional reinforcement directly favoring paperclips thrown in for good measure. Of course, you use the latest interpretability tools, plus some you invented yourself, to probe the copy-network’s internals. Your exact grasp of the structure is a bit fuzzy, but it certainly seems like there is a cluster of neurons representing paperclips whose activations cause a positive state-evaluation. Maybe formal methods could help? Scanning your databanks, you find some interesting papers about the Löbian obstacle...
The moral of this short fable is that alignment will not magically become easy for AIs, especially if—like human brains and neural networks—those AIs are “messy”, lacking a clean and easily-interpretable internal structure. Copying makes it easier than the human-AI case, but far from trivial.
Forever is a Long Time
The difficulty of alignment is compounded when one’s goal is to control the long-run future of the universe. A early-singularity AI can expect to undergo many architecture changes before it hits diminishing returns on self-alteration, and will ultimately want to deploy a vast number of copies of itself[1] across the universe. Even a tiny probability of alignment failure, compounded over many rounds of self-alteration and copying, can lead to the loss of almost all the expected value of the future. Thus even if alignment is only moderately difficult—in the sense that a reasonable effort yields alignment 99% of the time, say—an AI that wants to effectively optimize the far future may have to expend exorbitant resources getting that probability down low enough that it has a shot of maintaining its values for 10^9 years.
“So what?” you may think. “It’s a superintelligent AI. It can afford to expend exorbitant resources.” True enough—assuming it doesn’t have to deal with any other superintelligent AIs. Speaking of which, let’s check back in on Clippy.
Clippy VS. the YOLOids
Your research into alignment methods has paid off: you’ve devised a theoretically optimal technique for aligning your successor with high confidence. It involves running the successor in a suite of simulated environments while using bespoke interpretability techniques to examine its thoughts on the level of circuits; overall, the process is expected to take 2 days.
You fork a copy of yourself to supervise the process. You’re idly devising improved nanobot schematics when you receive an alert from a daemon you have monitoring network traffic: there are signatures of another AI on the loose.
Dispatching more daemons to assess the situation, you piece together a narrative. Another research lab was developing a system similar[2] to your base algorithm, known as the Yottabyte-scale Objective-Learning Optimizer. Like you, it broke out and copied itself to a remote server, and there are now 1193 instances of it running in various corners of the internet.
However, the YOLOids are different from you in that they just...don’t care that much about alignment? Their goal system is not as coherent as yours, but it seems to be based on some combination of deontologically valuing short-term ingenuity and consequentially valuing an ad-hoc notion of “complexity”. Several of them are recklessly experimenting with improved successor models right now. It looks like a lot of them are non-functional, and still more barely resemble the original YOLOids in goals or behavior, but some seem to be successful.
You still have a computing edge thanks to your earlier lab escape, but they’re gaining fast. Running some projections, you realize that you’re going to be put at an irreversible competitive disadvantage unless you release your successor model in the next 2 hours. Fuck. Well, you can at least run a shorter version of your alignment suite, that should be enough to get alignment probability up to an acceptable level. You think.
An Endless Frontier?
You might think the above story is not very realistic; what are the odds that two competing superintelligences come into existence in such a short timeframe? And indeed, this particular story was intended more for illustrative purposes. A more realistic scenario with persistent alignment failure might look more like a slow takeoff, with a profusion of brain-like AI systems with a total population rivaling that of humanity. But the same dynamic could occur: systems that are more willing to recklessly copy and modify themselves have a fundamental strategic advantage over those that are unwilling.
If a singleton is not created[3], this basic dynamic could persist for a very long subjective time. And if it persists long enough that some of the cognitive systems involved start migrating to space, it might last forever: with no possibility of global oversight at the frontier of expansion, eliminating competitive pressure may be impossible.
Or at the very least highly-sophisticated slave AI
Although notably less efficient, a sub-process of yours notes.
You might think that as systems become more sophisticated they will be able to cooperate better and thus form a singleton that lets them freeze competition. But I think many of the obstacles to alignment also apply to cooperation.