A base model is not well or badly aligned in the first place. It’s not agentic; “aligned” is not an adjective which applies to it at all. It does not have a goal of doing what its human creators want it to, it does not “make a choice” about which point to move towards when it is being tuned. Insofar as it has a goal, its goal is to predict next token, or some batch of goal-heuristics which worked well to predict next token in training. If you tune it on some thumbs-up/thumbs-down data from humans, it will not “try to correct the errors in the data supplied”, no matter how smart a base model it is, unless that somehow follows from heuristics to better predict next token.
Now, you could maybe imagine that there is some additional step in between base model and the application of whatever argument you’re trying to make here. Some step in which the AI acquires enough agency for any of what you’re saying to make sense. And then presumably that step would also instill some kind of goal (which might be “do what my creators want”), in which case that step is where all the alignment magic would need to happen.
Or maybe you imagine putting the base model in some kind of scaffolding which uses the model to do something agentic. And then the (scaffolding + model) might be agentic, and the thing you’re trying to say could apply if someone tries to tune the scaffolded model. And then all the action is in the scaffolding.
Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)
A base model is not well or badly aligned in the first place. It’s not agentic; “aligned” is not an adjective which applies to it at all. It does not have a goal of doing what its human creators want it to, it does not “make a choice” about which point to move towards when it is being tuned. Insofar as it has a goal, its goal is to predict next token, or some batch of goal-heuristics which worked well to predict next token in training. If you tune it on some thumbs-up/thumbs-down data from humans, it will not “try to correct the errors in the data supplied”, no matter how smart a base model it is, unless that somehow follows from heuristics to better predict next token.
Now, you could maybe imagine that there is some additional step in between base model and the application of whatever argument you’re trying to make here. Some step in which the AI acquires enough agency for any of what you’re saying to make sense. And then presumably that step would also instill some kind of goal (which might be “do what my creators want”), in which case that step is where all the alignment magic would need to happen.
Or maybe you imagine putting the base model in some kind of scaffolding which uses the model to do something agentic. And then the (scaffolding + model) might be agentic, and the thing you’re trying to say could apply if someone tries to tune the scaffolded model. And then all the action is in the scaffolding.
Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)