The alignment stability problem
The community thinks a lot about how to align AGI. It thinks less about how to align AGI so that it stays aligned for the long term. In many hypothetical cases, these are one and the same thing. But for the type of AGI we’re actually likely to get, I don’t think they are.
Despite some optimism for aligning tool-like AGI, or at least static systems, it seems likely that we will create AGI that learns after it’s deployed, and that has some amount of agency. If it does, its alignment will effectively shift, as addressed in the diamond maximizer thought experiment and elsewhere. And that’s even if it doesn’t deliberately change its preferences. People deliberately change their preferences sometimes, despite not having access to our own source code. So, it would seem wise to think seriously and explicitly about the stability problem, even if it isn’t needed for current-generation AGI research.
I’ve written a chapter on this, Goal changes in intelligent systems. There I laid out the problem, but I didn’t really propose solutions. What follows is a summary of that article, followed by a brief discussion of the work I’ve been able to locate on this problem, and one direction we might go to pursue it.
Why we don’t think about much about alignment stability, and why we should.
Some types of AGI are self-stabilizing. A sufficiently intelligent agent will try to prevent its goals from changing, at least if it is consequentialist. That works nicely if its values are one coherent construct, such as diamond or human preferences. But humans have lots of preferences, so we may wind up with a system that must balance many goals. And if the system keeps learning after deployment, it seems likely to alter its understanding of what its goals mean. This is the thrust of the diamond maximizer problem.
One tricky thing about alignment work is that we’re imagining different types of AGI when we talk about alignment schemes. Currently, people are thinking a lot about aligning deep networks. Current deep networks don’t keep learning after they’re deployed. And they’re not very agentic These are great properties for alignment, and they seem to be the source of some optimism.
Even if this type of network turns out to be really useful, and all we need to make the world a vastly better place, I don’t think we’re going to stop there. Agents would seem to have capabilities advantages that metaphorically make tool AI want to become agentic AI. If that weren’t enough, agents are cool. People are going to want to turn tool AI into agent AI just to experience the wonder of an alien intelligence with its own goals.
I think turning intelligent tools into agents is going to be relatively easy. But even if it’s not easy someone is going to manage it at some point.. It’s probably too difficult to prevent further experimentation, at least without a governing body, aided by AGI, that’s able and willing to at minimum intercept and de-encrypt every communication for signs of AGI projects.
While the above logic is far from airtight, it would seem wise to think about stable alignment solutions, in advance of anyone creating AGI that continuously learns outside of close human control.
Similar concerns have been raised elsewhere, such as On how various plans miss the hard bits of the alignment challenge. Here I’m trying to crystallize and give a name to this specific hard part of the problem.
Approaches to alignment stability
Alex Turner addresses this in A shot at the diamond-alignment problem. In broad form, he’s saying that you would train the agent with RL to value diamonds, including having diamonds associated with the reward in a variety of cognitive tasks. This is as good an answer as we’ve got. I don’t have a better idea; I think the area needs more work. Some difficulties with this scheme are raised in Contra shard theory, in the context of the diamond maximizer problem. Charlie Steiner’s argument that shard theory requires magic addresses roughly the same concerns. In sum, it’s going to be tricky to train a system so that it has the right set of goals when it acquires enough self-awareness to try to preserve its goals.
Note that none of these directly confront the additional problems of a multi-objective RL system. It could well be that an RL system with multiple goals will collapse to having only a single goal over the course of reflection and self-modification. Humans don’t do this, but we have both limited intelligence and a limited ability to self-modify.
Another approach to preventing goal changes in intelligent agents is corrigibility. If we can notice when the agent’s goals are changing, and instruct or retrain or otherwise modify them back to what we want, we’re goood. This is a great idea; the problem is that it’s another multi-objective alignment problem. Christiano has said “I grant that even given such a core [of corrigibility], we will still be left with important and unsolved x-risk relevant questions like “Can we avoid value drift over the process of deliberation?”″
I haven’t been able to find other work trying to provide a solution the diamond maximizer problem, or other formulations of the stability problem. I’m sure it’s out there, using different terminology and mixed into other alignment proposals. I’d love to get pointers on where to find this work.
A direction: asking if and how humans are stably aligned.
Are you stably aligned? I think so, but I’m not sure. I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing religion, most people seem to maintain their goal of helping other people (if they have such a goal); they just change their beliefs about how to best do that.
Humans only maintain that stability of several important goals across our relatively brief lifespans. Whether we’d do the same in the long term is an open question that I want to consider more carefully in future posts. And we might only maintain those goals with the influence of a variety of reward signals, such as getting a reward signal in the form of dopamine spikes when we make others happy. Even if we figure out how that works (the focus of Steve Byrnes’ work), including those rewards in a mature AGI might have bad side effects, like a universe tiled with simulacra of happy humans.
The human brain is not clearly the most promising model of alignment stability. But it’s what I understand best, so my efforts will go there. And there are other advantages to aligning brainlike AGI over other types. For instance, humans seem to have a critic system that could act as a “handle” for alignment. And brainlike AGI would seem to be a relatively good target for interpretability-heavy approaches, since we seem to think one important thought at a time, and we’re usually able to put them into words.
Much work remains to be done to understand alignment stability. I’ll delve further into the idea of training brainlike AGI to have enough of our values, in a long-term stable form, in future posts.
I’ll use goals here, but many definitions of values, objectives, or preferences could be swapped in.