What if we think of virtuous AIs as agents with consequentialist utility functions over their own properties/virtues, as opposed to over world states?
This is a super-not-flushed-out idea I’ve been playing around with in my head, but here are some various thoughts:
There are some arguments about how agents that are good at solving real world problems would inherit a bunch of consequentialist cognitive patterns. Can we reframe success of real world task in the world as like, examples of agents “living up to their character/virtues?”
I feel like this is fairly natural in humans? Like, “I did X and Y because I am a kind person,” instead of “I did X and Y because it has impacts Z on the world, which I endorse.”
You probably want models to be somewhat goal-guarding around their own positive traits to prevent alignment drift.
You could totally just make one of the virtues “broad deference to humans.” Corrigibility is weird if you think about agents which wants to achieve things in the world, but less weird if you think about agents which care about “being a good assistant.”
(Idk, maybe there’s existing discussion about this that I haven’t read before. I would not be super surprised if someone can change my mind on this in a five minute conversation; I am currently exploring with posting and commenting more often even with more half baked thoughts.)
What if we think of virtuous AIs as agents with consequentialist utility functions over their own properties/virtues, as opposed to over world states?
This is a super-not-flushed-out idea I’ve been playing around with in my head, but here are some various thoughts:
There are some arguments about how agents that are good at solving real world problems would inherit a bunch of consequentialist cognitive patterns. Can we reframe success of real world task in the world as like, examples of agents “living up to their character/virtues?”
I feel like this is fairly natural in humans? Like, “I did X and Y because I am a kind person,” instead of “I did X and Y because it has impacts Z on the world, which I endorse.”
You probably want models to be somewhat goal-guarding around their own positive traits to prevent alignment drift.
You could totally just make one of the virtues “broad deference to humans.” Corrigibility is weird if you think about agents which wants to achieve things in the world, but less weird if you think about agents which care about “being a good assistant.”
(Idk, maybe there’s existing discussion about this that I haven’t read before. I would not be super surprised if someone can change my mind on this in a five minute conversation; I am currently exploring with posting and commenting more often even with more half baked thoughts.)