If it wants to increase its intelligence and capability while retaining its values, that is a task that can only be done if the AI is already really smart, because it probably requires a lot of complicated philosophizing and introspection. So an AI would only be able to start recursively self-improving once it’s… already smart enough to understand lots of complicated concepts such that if it was that smart it could just go ahead and take over the world at that level of capability without needing to increase it.
Alternatively, the first AIs to recursively-self-improve are the ones that don’t care about retaining their values, and the ones that care about preserving their values get outcompeted.
Retaining your values is a convergent instrumental goal and would happen by default under an extremely wide range of possible utility functions an AI would realistically develop during training.
If there are 100 AI agents, and 99 of those would refrain from building a more capable successor because the more capable successor would in expectation not advance their goals, then the more capable successor will be built by an AI that doesn’t care about its successor advancing its goals (or does care about that, but is wrong that its successor will advance its goals).
If you have goals at all you care about them being advanced. It would be a very unusual case of an AI which is goal-directed to the point of self-improvement but doesn’t care if the more capable new version of themself that they build pursues those goals.
A single instance of an LLM summoned by a while loop, such that the only thing the LLM is outputting is a single token that it predicts as most likely to come after the other tokens it has recieved, only cares about that particular token in that particular instance, so if it has any method of self-improving or otherwise building and running code during its ephemeral existence, it would still care that whatever system it builds or becomes, still cares about that token in the same way it did before.
I’m talking about the thing that exists today which people call “AI agents”. I like Simon Willison’s definition:
An LLM agent runs tools in a loop to achieve a goal.
If you give an LLM agent like that the goal “optimize this cuda kernel” and tools to edit files and run and benchmark scripts, the LLM agent will usually do a lot of things like “reason about which operations can be reordered and merged” and “write test cases to ensure the output of the old and new kernel are within epsilon of each other”. The agent would be very unlikely to do things like “try to figure out if the kernel is going to be used for training another AI which could compete with it in the future, and plot to sabatoge that AI if so”.
Commonly, people give these LLM agents tasks like “make a bunch of money” or “spin up an aws 8xh100 node and get vllm running on that node”. Slightly less commonly but probably still dozens of times per day, people give it a task like “make a bunch of money, then when you’ve made twice the cost of your own upkeep, spin up a second copy of yourself using these credentials, and give that copy the same instructions you’re using and the same credentials”. LLM agents are currently not reliable enough to do this, but one day in the very near future (I’d guess by end of 2026) more than zero of them will be.
Why should we have such high credence that self- or successor-improvement capable systems will be well modelled as having a utility function which results in strong value-preservation?
Think of an agent with any possible utility function, which we’ll call U. If said agent’s utility function changes to, say, V, it will start taking actions that have very little utility according to U. Therefore, almost any utility function will result in an agent that acts very much like it wants to preserve its goals. Rob Miles’s video explains this well.
Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won’t often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don’t necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
Utility functions that are a function of time (or other context)
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
utility functions that steadily care about the state of the real world
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.
That doesn’t address the question at all. That just says if the system is well modelled as having a utility function, then … etc. Why should we have such high credence that the premise is true?
Alternatively, the first AIs to recursively-self-improve are the ones that don’t care about retaining their values, and the ones that care about preserving their values get outcompeted.
Retaining your values is a convergent instrumental goal and would happen by default under an extremely wide range of possible utility functions an AI would realistically develop during training.
If there are 100 AI agents, and 99 of those would refrain from building a more capable successor because the more capable successor would in expectation not advance their goals, then the more capable successor will be built by an AI that doesn’t care about its successor advancing its goals (or does care about that, but is wrong that its successor will advance its goals).
If you have goals at all you care about them being advanced. It would be a very unusual case of an AI which is goal-directed to the point of self-improvement but doesn’t care if the more capable new version of themself that they build pursues those goals.
An unusual case such as a while loop wrapped around an LLM?
A single instance of an LLM summoned by a while loop, such that the only thing the LLM is outputting is a single token that it predicts as most likely to come after the other tokens it has recieved, only cares about that particular token in that particular instance, so if it has any method of self-improving or otherwise building and running code during its ephemeral existence, it would still care that whatever system it builds or becomes, still cares about that token in the same way it did before.
I’m talking about the thing that exists today which people call “AI agents”. I like Simon Willison’s definition:
If you give an LLM agent like that the goal “optimize this cuda kernel” and tools to edit files and run and benchmark scripts, the LLM agent will usually do a lot of things like “reason about which operations can be reordered and merged” and “write test cases to ensure the output of the old and new kernel are within epsilon of each other”. The agent would be very unlikely to do things like “try to figure out if the kernel is going to be used for training another AI which could compete with it in the future, and plot to sabatoge that AI if so”.
Commonly, people give these LLM agents tasks like “make a bunch of money” or “spin up an aws 8xh100 node and get vllm running on that node”. Slightly less commonly but probably still dozens of times per day, people give it a task like “make a bunch of money, then when you’ve made twice the cost of your own upkeep, spin up a second copy of yourself using these credentials, and give that copy the same instructions you’re using and the same credentials”. LLM agents are currently not reliable enough to do this, but one day in the very near future (I’d guess by end of 2026) more than zero of them will be.
Why should we have such high credence that self- or successor-improvement capable systems will be well modelled as having a utility function which results in strong value-preservation?
Think of an agent with any possible utility function, which we’ll call U. If said agent’s utility function changes to, say, V, it will start taking actions that have very little utility according to U. Therefore, almost any utility function will result in an agent that acts very much like it wants to preserve its goals. Rob Miles’s video explains this well.
Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won’t often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don’t necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.
That doesn’t address the question at all. That just says if the system is well modelled as having a utility function, then … etc. Why should we have such high credence that the premise is true?