It seems you’re very optimistic that there’s only four utility functions.
If that was true your could do something like a value handshake and optimize that.
Or find a stable set of changes that are endorsed by all utility functions at every step and would end up with all four utility functions being equivalent.
Or you could create different enclaves with groups that allowed people to maximize their values within those groups, that would ultimately lead to one of the groups that cared about other groups taking on their values to win over time, while the other gets were getting their values met in the meantime.
But that’s assuming there are only four utility functions. Other options:
Humans don’t have consistent preferences, and therefore can’t be expressed in a utility function.
We do have terminal preferences that are consistent, but they are shaped by our experiences, such that there are as many utility functions as there are humans
Evolution gave us all only one terminal preference, and the seeming difference is just based on different strategies to reach that goal. Then alignment is very easy
Point being that I think there are a lot of hidden assumptions in this post. Both in terms of the problem space and the solution space.
It seems you’re very optimistic that there’s only four utility functions.
If that was true your could do something like a value handshake and optimize that.
Or find a stable set of changes that are endorsed by all utility functions at every step and would end up with all four utility functions being equivalent.
Or you could create different enclaves with groups that allowed people to maximize their values within those groups, that would ultimately lead to one of the groups that cared about other groups taking on their values to win over time, while the other gets were getting their values met in the meantime.
But that’s assuming there are only four utility functions. Other options:
Humans don’t have consistent preferences, and therefore can’t be expressed in a utility function.
We do have terminal preferences that are consistent, but they are shaped by our experiences, such that there are as many utility functions as there are humans
Evolution gave us all only one terminal preference, and the seeming difference is just based on different strategies to reach that goal. Then alignment is very easy
Point being that I think there are a lot of hidden assumptions in this post. Both in terms of the problem space and the solution space.