Utility functions that are a function of time (or other context)
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
utility functions that steadily care about the state of the real world
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
Could you taboo “steadily”? *All* utility functions care about the state of the real world, that’s what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.