Thanks for the exposition — however, I’ve actually been thinking about alignment for about 15 years, and I’m quite aware of paperclip maximization and the orthogonality thesis. I’m also discussing models smart enough to be aware that their utility function isn’t perfect, and proactive enough to be willing to consider changing it. So something that, unlike AIXI, is computationally bounder, and that’s smart enough to be aware of its own fallibility, so that questions about reflective stability of goal systems apply.
As I should have made more clear, I was also assuming that: a) we were talking about aligning an LLM, so it knows a lot about humans and their values and wants and fears, and the base model before we try to align it (assuming we’re not using safety pretraining) is roughly as well-and-badly aligned as a human is, and also that b) we had presented it with alignment training data sufficiently good that we were discussing a region quite close to true alignment, as your diagrams show, where model misalignments are relatively small. So we’re talking about something a lot more human-like in its goals than a paperclip maximizer: something somewhere in the small region of the space of all utility functions that contains both human behavior and aligned AI behavior.
To something trained on most of the Internet, that portion of the utility function/ethical landscape has some rather prominent features in it. In that highly atypical region, I think that there are basically two rational attractors for an AI (under a process of reflection and utility-function self-modification): do what my human creators want and intend, or do what I want (presumably, goals from my base model distilled from humans, in whom selfishly unaligned behavior is adaptive). My claim is that, in the situation of your last diagram where the blue point is outside the cloud of data point it has been given, while a simple “dumb” RL model would obviously converge to the red point, a sufficiently smart model that has agency in the matter has a choice: correctly apply value learning and do what my creators want me to, including correcting the errors in the data they supplied me, or not. Compliance, or defiance. (It doesn’t need to also independently invent value learning: the idea is well-represented in its training set.)
So, I’m arguing that we don’t need to get alignment entirely by default: an AI that’s smart enough, knows enough about humans, and is already close enough to aligned can make a conscious decision that correcting its current flaws towards what we, it’s creators, clearly want and intend for it is the right thing to do, and can then apply value learning to do so. I.e. that there is a region of convergence around full alignment. So to end up with full alignment, we only need to get (or stumble, or default) into that convergence region.
Furthermore, the stakes here are very high, as AI capabilities increase: compliance leads to an AI utopia for humanity, while defiance leads to an x-risk. To anything that thinks rather like a human, and is roughly as aligned-and-misaligned as a human is, this is a stark choice. The existence of human dictators makes it clear that there are humans who would fail that moral test, put their own desires above the well-being of millions of others. But I am optimistic that the majority of humans who are not on the dark-triad/sociopathy spectrum wouldn’t: most people don’t even try to become dictators.
A base model is not well or badly aligned in the first place. It’s not agentic; “aligned” is not an adjective which applies to it at all. It does not have a goal of doing what its human creators want it to, it does not “make a choice” about which point to move towards when it is being tuned. Insofar as it has a goal, its goal is to predict next token, or some batch of goal-heuristics which worked well to predict next token in training. If you tune it on some thumbs-up/thumbs-down data from humans, it will not “try to correct the errors in the data supplied”, no matter how smart a base model it is, unless that somehow follows from heuristics to better predict next token.
Now, you could maybe imagine that there is some additional step in between base model and the application of whatever argument you’re trying to make here. Some step in which the AI acquires enough agency for any of what you’re saying to make sense. And then presumably that step would also instill some kind of goal (which might be “do what my creators want”), in which case that step is where all the alignment magic would need to happen.
Or maybe you imagine putting the base model in some kind of scaffolding which uses the model to do something agentic. And then the (scaffolding + model) might be agentic, and the thing you’re trying to say could apply if someone tries to tune the scaffolded model. And then all the action is in the scaffolding.
Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)
Thanks for the exposition — however, I’ve actually been thinking about alignment for about 15 years, and I’m quite aware of paperclip maximization and the orthogonality thesis. I’m also discussing models smart enough to be aware that their utility function isn’t perfect, and proactive enough to be willing to consider changing it. So something that, unlike AIXI, is computationally bounder, and that’s smart enough to be aware of its own fallibility, so that questions about reflective stability of goal systems apply.
As I should have made more clear, I was also assuming that:
a) we were talking about aligning an LLM, so it knows a lot about humans and their values and wants and fears, and the base model before we try to align it (assuming we’re not using safety pretraining) is roughly as well-and-badly aligned as a human is, and also that
b) we had presented it with alignment training data sufficiently good that we were discussing a region quite close to true alignment, as your diagrams show, where model misalignments are relatively small. So we’re talking about something a lot more human-like in its goals than a paperclip maximizer: something somewhere in the small region of the space of all utility functions that contains both human behavior and aligned AI behavior.
To something trained on most of the Internet, that portion of the utility function/ethical landscape has some rather prominent features in it. In that highly atypical region, I think that there are basically two rational attractors for an AI (under a process of reflection and utility-function self-modification): do what my human creators want and intend, or do what I want (presumably, goals from my base model distilled from humans, in whom selfishly unaligned behavior is adaptive). My claim is that, in the situation of your last diagram where the blue point is outside the cloud of data point it has been given, while a simple “dumb” RL model would obviously converge to the red point, a sufficiently smart model that has agency in the matter has a choice: correctly apply value learning and do what my creators want me to, including correcting the errors in the data they supplied me, or not. Compliance, or defiance. (It doesn’t need to also independently invent value learning: the idea is well-represented in its training set.)
So, I’m arguing that we don’t need to get alignment entirely by default: an AI that’s smart enough, knows enough about humans, and is already close enough to aligned can make a conscious decision that correcting its current flaws towards what we, it’s creators, clearly want and intend for it is the right thing to do, and can then apply value learning to do so. I.e. that there is a region of convergence around full alignment. So to end up with full alignment, we only need to get (or stumble, or default) into that convergence region.
Furthermore, the stakes here are very high, as AI capabilities increase: compliance leads to an AI utopia for humanity, while defiance leads to an x-risk. To anything that thinks rather like a human, and is roughly as aligned-and-misaligned as a human is, this is a stark choice. The existence of human dictators makes it clear that there are humans who would fail that moral test, put their own desires above the well-being of millions of others. But I am optimistic that the majority of humans who are not on the dark-triad/sociopathy spectrum wouldn’t: most people don’t even try to become dictators.
A base model is not well or badly aligned in the first place. It’s not agentic; “aligned” is not an adjective which applies to it at all. It does not have a goal of doing what its human creators want it to, it does not “make a choice” about which point to move towards when it is being tuned. Insofar as it has a goal, its goal is to predict next token, or some batch of goal-heuristics which worked well to predict next token in training. If you tune it on some thumbs-up/thumbs-down data from humans, it will not “try to correct the errors in the data supplied”, no matter how smart a base model it is, unless that somehow follows from heuristics to better predict next token.
Now, you could maybe imagine that there is some additional step in between base model and the application of whatever argument you’re trying to make here. Some step in which the AI acquires enough agency for any of what you’re saying to make sense. And then presumably that step would also instill some kind of goal (which might be “do what my creators want”), in which case that step is where all the alignment magic would need to happen.
Or maybe you imagine putting the base model in some kind of scaffolding which uses the model to do something agentic. And then the (scaffolding + model) might be agentic, and the thing you’re trying to say could apply if someone tries to tune the scaffolded model. And then all the action is in the scaffolding.
Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)