(My own view, not Convergence’s, as with most/all of my comments)
I think that’s quite an interesting perspective (and thanks for sharing it!). I think I’m new enough to this topic, and it’s complicated enough, that I personally should sort-of remain agnostic for now on whether it’s better to a) try to find a “consistent version” of a person’s actual, intransitive preferences, or b) just accept those intransitive preferences and try to fulfil them as best we can.
As an example of my agnosticism on a very similar question, in another post I published the day after this one, I wrote:
Value conflict (VC) is when some or all of the values a person (or group) actually has are in conflict with each other. It’s like the person has multiple, competing utility functions, or different “parts of themselves” pushing them in different directions.
[...] It seems unclear whether VC is a “problem”, as opposed to an acceptable result of the fragility and complexity of our value systems. It thus also seems unclear whether and how one should try to “solve” it. That said, it seems like three of the most obvious options for “solving” it are to:
[...]
*Embrace moral pluralism
**E.g., decide to keep as values each of the conflicting values, and just give them a certain amount of “say” or “weight” in your decision-making.
Also, somewhat separately from the broader question of “fixing” vs “accepting” intransitivity, I find your idea for a “humble morality” or “democratic referee” singleton quite interesting. At first glance, it seems plausible as a way to sort-of satisfice in an acceptable way—guaranteeing a fairly high chance of an acceptably good outcome, even if perhaps at the cost of not getting the very best outcomes and letting some value be lost. I’d be interested in seeing a top level post fleshing that idea out further (if there isn’t one already?), and/or some comments seeing if they can tear the idea apart :D
At a high level, I think that the main implication of this view is that we should be considering other models for future AI systems besides optimizing over the long term for a single goal or for a particular utility or reward function.
One small nit-pick/question, though: you write “Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.” I don’t see why creating a “consistent version” of someone’s preferences and extracting a utility function from that would be reducing morality to one dimension. The utility function could still reflect a lot of the complexity and fragility of values (caring about many different things, caring about interactions between things, having diminishing returns or “too much of a good thing”, etc.), even if it does shave off some/a lot of that complexity for the sake of consistency.
I guess we could say that the whole utility function is the “one dimension”, but that’d seem misleading to me. That’s because it seems like the usual reason given for worrying about something like a “one-dimensional morality” is that it won’t reflect that we value multiple things in complicated ways, whereas the whole utility function can reflect that (at least in part, even if it doesn’t fully capture our original, intransitive values).
Yes, the utility function is the “one dimension”. Of course, it can be as complicated as you’d like, taking into account multiple aspects of reality. But ultimately, it has to give a weight to those aspects; “this 5-year-old’s life is worth exactly XX.XXX times more/less than this 80-year-olds’ life” or whatever. It is a map from some complicated (effectively infinite-dimensional) Omega to a simple one-dimensional utility.
(My own view, not Convergence’s, as with most/all of my comments)
I think that’s quite an interesting perspective (and thanks for sharing it!). I think I’m new enough to this topic, and it’s complicated enough, that I personally should sort-of remain agnostic for now on whether it’s better to a) try to find a “consistent version” of a person’s actual, intransitive preferences, or b) just accept those intransitive preferences and try to fulfil them as best we can.
As an example of my agnosticism on a very similar question, in another post I published the day after this one, I wrote:
Also, somewhat separately from the broader question of “fixing” vs “accepting” intransitivity, I find your idea for a “humble morality” or “democratic referee” singleton quite interesting. At first glance, it seems plausible as a way to sort-of satisfice in an acceptable way—guaranteeing a fairly high chance of an acceptably good outcome, even if perhaps at the cost of not getting the very best outcomes and letting some value be lost. I’d be interested in seeing a top level post fleshing that idea out further (if there isn’t one already?), and/or some comments seeing if they can tear the idea apart :D
Parts of your comment also reminded me of Rohin Shah’s post on AI safety without goal directed behaviour (and the preceding posts), in which he says, e.g.:
One small nit-pick/question, though: you write “Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.” I don’t see why creating a “consistent version” of someone’s preferences and extracting a utility function from that would be reducing morality to one dimension. The utility function could still reflect a lot of the complexity and fragility of values (caring about many different things, caring about interactions between things, having diminishing returns or “too much of a good thing”, etc.), even if it does shave off some/a lot of that complexity for the sake of consistency.
I guess we could say that the whole utility function is the “one dimension”, but that’d seem misleading to me. That’s because it seems like the usual reason given for worrying about something like a “one-dimensional morality” is that it won’t reflect that we value multiple things in complicated ways, whereas the whole utility function can reflect that (at least in part, even if it doesn’t fully capture our original, intransitive values).
Am I misunderstanding what you meant there?
Yes, the utility function is the “one dimension”. Of course, it can be as complicated as you’d like, taking into account multiple aspects of reality. But ultimately, it has to give a weight to those aspects; “this 5-year-old’s life is worth exactly XX.XXX times more/less than this 80-year-olds’ life” or whatever. It is a map from some complicated (effectively infinite-dimensional) Omega to a simple one-dimensional utility.