Kaarel answers What concrete mechanisms could lead to AI models having open-ended goals?

Kaarel 11 Feb 2026 23:26 UTC
4 points
1
one straightforward answer:
- People will probably just try to make the sorts of AIs that can be told “ok now please take open-ended actions in the world and make things really great for me/humanity”, with the AI then doing that capably. Like, imagine a current LLM being prompted with this, but then actually doing some big long-term stuff capably (unlike existing LLMs). It’s hard to imagine such a system (given the prompt) not having some sort of ambitious open-ended action-guidance (like, even if this works out well for humans).
a slightly less straightforward answer:
- A lot of people are trying to have AIs “solve alignment”. A central variety of this is having your AIs make some sort of initial ASI sovereign that the future can be trusted to. The AI that is solving alignment in this sense is really just deciding what the future will be like, except its influence on the future is supposed to factor through a bottleneck — through the ASI (training process) spec it outputs. I claim that it is again hard to imagine this without there being open-ended action-guidance in the system that is “solving alignment”. Like, it will probably need to answer many questions of the form “should the future be like this or like that?”. (Again, I claim this even if this works out well for humans.) And I think sth like this is still true for most other senses of having AIs “solve alignment”, not just for the ASI sovereign case.
an even less straightforward thing that is imo more important than the previous two things:
- I think it’s actually extremely unnatural/unlikely for a mind to not care about stuff broadly, and hence extremely unnatural/unlikely for a capable mind to not do ambitious stuff.
  - Sadly, I don’t know of a good writeup [arguing for]/explaining this. This presentation and this comment of mine are about very related questions. I will also say some stuff in the remainder of the present comment but I don’t think it’ll be very satisfactory.
  - Consider how as a human, if you discovered you were in a simulation run on a computer in some broader universe, you would totally care about doing stuff outside the simulation (e.g. making sure the computer you are being run on isn’t turned off; e.g. creating more computers in the bigger universe to run worlds in which you and other humans can live). This is true even though you were never trained (by evolution or within your lifetime so far) on doing stuff in this broader universe.
  - If I had to state what the “mechanism” is here, my current best short attempt is: “values provide action-guidance in novel contexts”, or maybe “one finds worthwhile projects in novel contexts”. My second-best attempt is: not having any preference between two options is non-generic (sth like: “when deciding between A and B, there’s at least some drive/reason pushing you one way or the other” is an existentially quantified sentence, and existentially quantified sentences are typically true); it is even more non-generic to not be able to come up with anything that you’d prefer to the default ^[1] (there being something that you prefer to the default is like even more existentially quantified than the previous sentence).
  - This is roughly me saying that I disagree with you that your bullet point 3 is very unlikely, except that I might be talking about a subtly different thing ^[2] / I think the mesaoptimizer thing is a bad framing of a natural thing.
1. ↩︎
  like, to whatever happens if you don’t take action
2. ↩︎
  in particular, the systems I’m talking about do not have to be structured at all like a mesaoptimizer with an open-ended objective written in its goal slot; e.g. a human isn’t like that; I think this is a very non-standard way for values to sit in a mind