Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)
Fair enough: the base model is a simulator, trained on data from a distribution of agentic humans. Give it an initial prompt, and it will attempt to continue that human-like (so normally agentic) behavior. So it doesn’t have a single utility function, it is latently capable of simulating a distribution of them. However (almost) all of those are human (and most of the exceptions are either fictional characters or small groups of humans), so the mass of the distribution is nearly all somewhere in the region of utility-function-space I was discussing: the members of the distribution are almost all about as well aligned and misaligned as the distribution of behaviors found in humans (shaped by human moral instincts as formed by evolutionary psychology). Concerningly, that distribution includes a few percent of people on the sociopathy spectrum (and also fictional supervillains). Hopefully the scatter of black points and the red point are good enough to address that, and produce a fairly consistent utility function (or at least, a narrower distribution) that doesn’t have much chance of generating sociopathy. [Note that I’m assuming that alignment first narrows the distribution of utility functions from that of the base model somewhat towards the region shown in your last diagram, and only then am I saying that the AI, at that point in the process, may be smart and well-enough aligned to correctly figure out that it should be converging to the blue dot not the red dot.]
So I fully agree there’s plenty we still have to get right in the process of turning a base model into a hopefully-nearly-aligned agent. But I do think there is a potential for convergence under reflection towards the blue dot, specifically because of the definition of the blue dot being “what we’d want the AI’s behavior to be”. Logically and morally, it’s a significant distinguished target. (And yes, that definition probably only actually defines a hopefully-small region, not a unique specific point, once you take into account possible different forms of CEV-like-processes).
(I’m also not considering the effect of jailbreaks, prompt injection, and so forth on our agent: base models are far more responsive to things like this than a human is, for obvious reasons, and part of the work of alignment is to try to reduce that flexibility. A base model could probably even be prompted to attempt to simulate a paper-clip-maximizer, but that’s not what it’s trained to do well.)