Most of the AI takeover thought experiments and stories I remember are about a kind of AI that has open-ended goals: the Squiggle Maximizer, the Sorcerer’s Apprentice robot, Clippy, probably also U3, Consensus-1, and Sable. I wonder what concrete mechanisms could even lead to models having open-ended goals.
Here are my best guesses:
Training on open-ended tasks, given enough capabilities or the right scaffolding
RL with open-ended reward specifications, like maximizing cumulative reward with no terminal reward and no time penalty (like the Coast Runners example of specification gaming)
Mesa-optimization, where SGD finds a policy that internally implements an open-ended objective that happens to perform well on a bounded outer task
Number 3 seems possible but very unlikely, because the learned objective would need to persist beyond the episode, outperform simpler heuristics on the training distribution, and not be suppressed by training signals like a penalty for wasted computation.
Things I think are not realistic mechanisms of open-ended goal formation:
Instrumental convergence, because if subgoals are instrumental there’s no incentive to keep pursuing them after the parent goal has been accomplished
Uncertainty about whether a goal has been accomplished (the Sorcerer’s Apprentice failure mode), because this hypothetical is not bearing out empirically (current models are satisficers that don’t seem to reason about minimizing uncertainty)
So, what concrete mechanisms could lead to models having open-ended goals?
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like “acquire resources.” If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then it has open-ended goals. I also claim next token prediction is actually pretty open ended.
Not sure if I agree wrt instrumental convergence. I think you’re assuming the system knows the parent goal has been accomplished with certainty, and more importantly that the parent goal can be accomplished in a terminal sense. Many real training objectives don’t have neat termination conditions. A model trained to “be helpful” or “maximize user engagement” has no natural stopping point.
This feels plausible to me but handwavy, if the idea is that such preferences would be decoupled from the training-reinforced preference to complete an intended task. Is that what you meant? I’m reminded of this Palisade study on shutdown resistance, where across the board, the models expressed wanting to avoid shutdown to complete the task.
This makes sense to me as a possible concrete mechanism to keep an eye out for.
I’m assuming the pattern we’re seeing so far will hold, which is that models satisfice rather than try to figure out how to maximize their certainty of a goal being accomplished. The “become a maximizer to minimize uncertainty” thing isn’t empirically grounded, so far.
Hm. Models are trained to “be helpful” now, and they stop just fine. I do agree that “maximize user engagement” has no natural stopping point; it’s the kind of concrete mechanism I tried to capture in number 1 above (Training on open-ended tasks).
I claim the reasons model stop right now are mostly issues of capability wrt context rot and the limitations of in context learning, so I think if you placed a model with “today’s values” in a model with “tomorrow’s capabilities” then we’d see maximizing behaviour. I also claim that arguments from how things are right now aren’t applicable here because the claim is the instrumental convergence is a step change for which current models are a poor analogy (unless there’s a specific reason to believe they’d be good ones, like a well made model organism).
one straightforward answer:
People will probably just try to make the sorts of AIs that can be told “ok now please take open-ended actions in the world and make things really great for me/humanity”, with the AI then doing that capably. Like, imagine a current LLM being prompted with this, but then actually doing some big long-term stuff capably (unlike existing LLMs). It’s hard to imagine such a system (given the prompt) not having some sort of ambitious open-ended action-guidance (like, even if this works out well for humans).
a slightly less straightforward answer:
A lot of people are trying to have AIs “solve alignment”. A central variety of this is having your AIs make some sort of initial ASI sovereign that the future can be trusted to. The AI that is solving alignment in this sense is really just deciding what the future will be like, except its influence on the future is supposed to factor through a bottleneck — through the ASI (training process) spec it outputs. I claim that it is again hard to imagine this without there being open-ended action-guidance in the system that is “solving alignment”. Like, it will probably need to answer many questions of the form “should the future be like this or like that?”. (Again, I claim this even if this works out well for humans.) And I think sth like this is still true for most other senses of having AIs “solve alignment”, not just for the ASI sovereign case.
an even less straightforward thing that is imo more important than the previous two things:
I think it’s actually extremely unnatural/unlikely for a mind to not care about stuff broadly, and hence extremely unnatural/unlikely for a capable mind to not do ambitious stuff.
Sadly, I don’t know of a good writeup [arguing for]/explaining this. This presentation and this comment of mine are about very related questions. I will also say some stuff in the remainder of the present comment but I don’t think it’ll be very satisfactory.
Consider how as a human, if you discovered you were in a simulation run on a computer in some broader universe, you would totally care about doing stuff outside the simulation (e.g. making sure the computer you are being run on isn’t turned off; e.g. creating more computers in the bigger universe to run worlds in which you and other humans can live). This is true even though you were never trained (by evolution or within your lifetime so far) on doing stuff in this broader universe.
If I had to state what the “mechanism” is here, my current best short attempt is: “values provide action-guidance in novel contexts”, or maybe “one finds worthwhile projects in novel contexts”. My second-best attempt is: not having any preference between two options is non-generic (sth like: “when deciding between A and B, there’s at least some drive/reason pushing you one way or the other” is an existentially quantified sentence, and existentially quantified sentences are typically true); it is even more non-generic to not be able to come up with anything that you’d prefer to the default [1] (there being something that you prefer to the default is like even more existentially quantified than the previous sentence).
This is roughly me saying that I disagree with you that your bullet point 3 is very unlikely, except that I might be talking about a subtly different thing [2] / I think the mesaoptimizer thing is a bad framing of a natural thing.
like, to whatever happens if you don’t take action
in particular, the systems I’m talking about do not have to be structured at all like a mesaoptimizer with an open-ended objective written in its goal slot; e.g. a human isn’t like that; I think this is a very non-standard way for values to sit in a mind