[Question] What concrete mechanisms could lead to AI models having open-ended goals?

Most of the AI takeover thought experiments and stories I remember are about a kind of AI that has open-ended goals: the Squiggle Maximizer, the Sorcerer’s Apprentice robot, Clippy, probably also U3, Consensus-1, and Sable. I wonder what concrete mechanisms could even lead to models having open-ended goals.

Here are my best guesses:

  1. Training on open-ended tasks, given enough capabilities or the right scaffolding

  2. RL with open-ended reward specifications, like maximizing cumulative reward with no terminal reward and no time penalty (like the Coast Runners example of specification gaming)

  3. Mesa-optimization, where SGD finds a policy that internally implements an open-ended objective that happens to perform well on a bounded outer task

Number 3 seems possible but very unlikely, because the learned objective would need to persist beyond the episode, outperform simpler heuristics on the training distribution, and not be suppressed by training signals like a penalty for wasted computation.

Things I think are not realistic mechanisms of open-ended goal formation:

  1. Instrumental convergence, because if subgoals are instrumental there’s no incentive to keep pursuing them after the parent goal has been accomplished

  2. Uncertainty about whether a goal has been accomplished (the Sorcerer’s Apprentice failure mode), because this hypothetical is not bearing out empirically (current models are satisficers that don’t seem to reason about minimizing uncertainty)

So, what concrete mechanisms could lead to models having open-ended goals?

No comments.