The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.
Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they’re sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac’s question still stands.