Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won’t often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don’t necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
Utility functions that are a function of time (or other context)
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
utility functions that steadily care about the state of the real world
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.
Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won’t often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don’t necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.