I believe this is precisely the wrong thing to be trying to do. We should be trying to philosophically understand how intransitive preferences can be (collectively) rational, not trying to remove them because they’re individually irrational.
(I’m going to pose the rest of this comment in terms of “what rules should an effectively-omnipotent super-AI operate under”. That’s not because I think that such a singleton AI is likely to exist in the foreseeable future; but rather, because I think it’s a useful rhetorical device and/or intuition pump for thinking about morality.)
Once you’ve reduced everything down to a single utility function, and you’ve created an agent or force substantially more powerful than yourself who’s seeking to optimize that utility function, it’s all downhill from there. Or uphill, whatever; the point is, you no longer get to decide on outcomes. Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.
Luckily, “curl” (intransitive preference order) isn’t just a matter of failures of individual rationality. Condorcet cycles, Arrow’s theorem, the Gibbard-Satterthwaite theorem; all of these deal with the fact that collective preferences can be intransitive even when the individual preferences that make them up are all transitive.
Imagine a “democratic referee” AI (or god, or whatever) that operated under roughly the following “3 laws”:
0. In comparing two world-states, consider the preferences of all “people” which exist in either, for some clear and reasonable definition of “people” which I’ll leave unspecified.
1. If a world-state is Pareto dominated, act to move towards the Pareto frontier.
2. If an agent or agents are seeking to change from one world-state A to another B, and neither of the two pareto dominates, then thwart that change iff a majority prefers the status quo A over B AND there is no third world-state C such that a majority prefers B over C and a majority prefers C over A.
3. Accumulate and preserve power, insofar as it is compatible with laws 1 and 2.
An entity which followed these laws would be, in practice, far “humbler” than one which had a utility function over world-states. For instance, if there were a tyrant hogging all the resources that they could reasonably get any enjoyment whatsoever out of, the “referee” would allow that inequality to continue; though it wouldn’t allow it to be instituted in the first place. Also, this “referee” would not just allow Omelas to continue to exist; it would positively protect its existence for as long as “those who walk away” were a minority.
So I’m not offering this hypothetical “referee” as my actual moral ideal. But I do think that moral orderings should be weak orderings over world states, not full utility functions. I’d call this “humble morality”; and, while as I said above I don’t actually guess that singleton AIs are likely, I do think that if I were worried about singleton AIs I’d want one that was “humble” in this sense.
Furthermore, I think that respecting the kind of cyclical preferences that come from collective preference aggregation is useful in thinking about morality. And, part of my guess that “singleton AIs are unlikely” comes from thinking that maintaining/enforcing perfectly coherent aggregated preferences over a complex system of parts is actually a harder (and perhaps impossible) problem than AGI.
(My own view, not Convergence’s, as with most/all of my comments)
I think that’s quite an interesting perspective (and thanks for sharing it!). I think I’m new enough to this topic, and it’s complicated enough, that I personally should sort-of remain agnostic for now on whether it’s better to a) try to find a “consistent version” of a person’s actual, intransitive preferences, or b) just accept those intransitive preferences and try to fulfil them as best we can.
As an example of my agnosticism on a very similar question, in another post I published the day after this one, I wrote:
Value conflict (VC) is when some or all of the values a person (or group) actually has are in conflict with each other. It’s like the person has multiple, competing utility functions, or different “parts of themselves” pushing them in different directions.
[...] It seems unclear whether VC is a “problem”, as opposed to an acceptable result of the fragility and complexity of our value systems. It thus also seems unclear whether and how one should try to “solve” it. That said, it seems like three of the most obvious options for “solving” it are to:
[...]
*Embrace moral pluralism
**E.g., decide to keep as values each of the conflicting values, and just give them a certain amount of “say” or “weight” in your decision-making.
Also, somewhat separately from the broader question of “fixing” vs “accepting” intransitivity, I find your idea for a “humble morality” or “democratic referee” singleton quite interesting. At first glance, it seems plausible as a way to sort-of satisfice in an acceptable way—guaranteeing a fairly high chance of an acceptably good outcome, even if perhaps at the cost of not getting the very best outcomes and letting some value be lost. I’d be interested in seeing a top level post fleshing that idea out further (if there isn’t one already?), and/or some comments seeing if they can tear the idea apart :D
At a high level, I think that the main implication of this view is that we should be considering other models for future AI systems besides optimizing over the long term for a single goal or for a particular utility or reward function.
One small nit-pick/question, though: you write “Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.” I don’t see why creating a “consistent version” of someone’s preferences and extracting a utility function from that would be reducing morality to one dimension. The utility function could still reflect a lot of the complexity and fragility of values (caring about many different things, caring about interactions between things, having diminishing returns or “too much of a good thing”, etc.), even if it does shave off some/a lot of that complexity for the sake of consistency.
I guess we could say that the whole utility function is the “one dimension”, but that’d seem misleading to me. That’s because it seems like the usual reason given for worrying about something like a “one-dimensional morality” is that it won’t reflect that we value multiple things in complicated ways, whereas the whole utility function can reflect that (at least in part, even if it doesn’t fully capture our original, intransitive values).
Yes, the utility function is the “one dimension”. Of course, it can be as complicated as you’d like, taking into account multiple aspects of reality. But ultimately, it has to give a weight to those aspects; “this 5-year-old’s life is worth exactly XX.XXX times more/less than this 80-year-olds’ life” or whatever. It is a map from some complicated (effectively infinite-dimensional) Omega to a simple one-dimensional utility.
I think you do need preference aggregation, not just conflict resolution. For example, if the AI finds $100 on the ground, should it give the money to Alice or Bob? Either would be a Pareto improvement over the current state. To decide, it needs weighting coefficients for different people’s utilities—which direction vector to take when moving toward the Pareto frontier. That’s the same information as having a joint utility function. And that utility function could end up determining most of the future, if the universe has lots of stuff that the AI can acquire and give to humans, compared to stuff that humans can acquire by themselves.
Good point. Here’s the raw beginnings of a response:
The idea here would be to resolve such questions “democratically” in some sense. I’m intentionally leaving unspecified what I mean by that because I don’t want to give the impression that I think I could ever tie up all the loose ends with this proposal. In other words, this is a toy example to suggest that there’s useful space to explore in between “fully utilitarian agents” and “non-agents”, that agents with weakly-ordered and/or intransitive preferences may in some senses be superior to fully-utilitarian ones.
I realize that “democratic” answers to the issue you raise will tend to be susceptible to the “Omelas problem” (a majority that gets small benefits by imposing large costs on a minority) and/or the repugnant conclusion (“cram the world with people until barely-over-half of their lives are barely-better-than-death”). Thus, I do not think that “majority rules” should actually be a foundational principle. But I do think that when you encounter intransitivity in collective preferences, it may in some cases be better to live with that than to try to subtract it out by converting everything into comparable-and-summable utility functions.
The costs [of consistency] seem small to me, because consistency requires nothing more than having an ordering on possible worlds. For example, if some possible world seems ok to you, you can put it at the top of the ordering. So assuming infinite power, any ok outcome that can be achieved by any other system can be achieved by a consistent system.
With that in mind, can you try spelling out in what sense an inconsistent system could be better?
Consistency is the opposite of humility. Instead of saying “sometimes, I don’t and can’t know”, it says “I will definitively answer any question (of the correct form)”.
Let’s assume that there is some consistent utility function that we’re using as a basis for comparison. This could be the “correct” utility function (eg, God’s); it could be a given individual’s extrapolated consistent utility; or it could be some well-defined function of many people’s utility.
So, given that we’ve assumed that this function exists, obviously if there’s a quasi-omnipotent agent rationally maximizing it, it will be maximized. This outcome will be at least as good as if the agent is “humble”, with a weakly-ordered objective function; and, in many cases, it will be better. So, you’re right, under this metric, the best utility function is equal-or-better to any humble objective.
But if you get the utility function wrong, it could be much worse than a humble objective. For instance, consider adding some small amount of Gaussian noise to the utility. The probability that the “optimized” outcome will have a utility arbitrarily close to the lower bound could, depending on various things, be arbitrarily high; while I think you can argue that a “humble” deus ex machina, by allowing other agents to have more power to choose between world-states over which the machina has no strict preference, would be less likely to end up in such an arbitrarily bad “Goodhart” outcome.
This response is a bit sketchy, but does it answer your question?
I believe this is precisely the wrong thing to be trying to do. We should be trying to philosophically understand how intransitive preferences can be (collectively) rational, not trying to remove them because they’re individually irrational.
(I’m going to pose the rest of this comment in terms of “what rules should an effectively-omnipotent super-AI operate under”. That’s not because I think that such a singleton AI is likely to exist in the foreseeable future; but rather, because I think it’s a useful rhetorical device and/or intuition pump for thinking about morality.)
Once you’ve reduced everything down to a single utility function, and you’ve created an agent or force substantially more powerful than yourself who’s seeking to optimize that utility function, it’s all downhill from there. Or uphill, whatever; the point is, you no longer get to decide on outcomes. Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.
Luckily, “curl” (intransitive preference order) isn’t just a matter of failures of individual rationality. Condorcet cycles, Arrow’s theorem, the Gibbard-Satterthwaite theorem; all of these deal with the fact that collective preferences can be intransitive even when the individual preferences that make them up are all transitive.
Imagine a “democratic referee” AI (or god, or whatever) that operated under roughly the following “3 laws”:
0. In comparing two world-states, consider the preferences of all “people” which exist in either, for some clear and reasonable definition of “people” which I’ll leave unspecified.
1. If a world-state is Pareto dominated, act to move towards the Pareto frontier.
2. If an agent or agents are seeking to change from one world-state A to another B, and neither of the two pareto dominates, then thwart that change iff a majority prefers the status quo A over B AND there is no third world-state C such that a majority prefers B over C and a majority prefers C over A.
3. Accumulate and preserve power, insofar as it is compatible with laws 1 and 2.
An entity which followed these laws would be, in practice, far “humbler” than one which had a utility function over world-states. For instance, if there were a tyrant hogging all the resources that they could reasonably get any enjoyment whatsoever out of, the “referee” would allow that inequality to continue; though it wouldn’t allow it to be instituted in the first place. Also, this “referee” would not just allow Omelas to continue to exist; it would positively protect its existence for as long as “those who walk away” were a minority.
So I’m not offering this hypothetical “referee” as my actual moral ideal. But I do think that moral orderings should be weak orderings over world states, not full utility functions. I’d call this “humble morality”; and, while as I said above I don’t actually guess that singleton AIs are likely, I do think that if I were worried about singleton AIs I’d want one that was “humble” in this sense.
Furthermore, I think that respecting the kind of cyclical preferences that come from collective preference aggregation is useful in thinking about morality. And, part of my guess that “singleton AIs are unlikely” comes from thinking that maintaining/enforcing perfectly coherent aggregated preferences over a complex system of parts is actually a harder (and perhaps impossible) problem than AGI.
(My own view, not Convergence’s, as with most/all of my comments)
I think that’s quite an interesting perspective (and thanks for sharing it!). I think I’m new enough to this topic, and it’s complicated enough, that I personally should sort-of remain agnostic for now on whether it’s better to a) try to find a “consistent version” of a person’s actual, intransitive preferences, or b) just accept those intransitive preferences and try to fulfil them as best we can.
As an example of my agnosticism on a very similar question, in another post I published the day after this one, I wrote:
Also, somewhat separately from the broader question of “fixing” vs “accepting” intransitivity, I find your idea for a “humble morality” or “democratic referee” singleton quite interesting. At first glance, it seems plausible as a way to sort-of satisfice in an acceptable way—guaranteeing a fairly high chance of an acceptably good outcome, even if perhaps at the cost of not getting the very best outcomes and letting some value be lost. I’d be interested in seeing a top level post fleshing that idea out further (if there isn’t one already?), and/or some comments seeing if they can tear the idea apart :D
Parts of your comment also reminded me of Rohin Shah’s post on AI safety without goal directed behaviour (and the preceding posts), in which he says, e.g.:
One small nit-pick/question, though: you write “Reducing morality to one dimension makes it boring at best; and, if Goodhart has anything to say about it, ultimately even immoral.” I don’t see why creating a “consistent version” of someone’s preferences and extracting a utility function from that would be reducing morality to one dimension. The utility function could still reflect a lot of the complexity and fragility of values (caring about many different things, caring about interactions between things, having diminishing returns or “too much of a good thing”, etc.), even if it does shave off some/a lot of that complexity for the sake of consistency.
I guess we could say that the whole utility function is the “one dimension”, but that’d seem misleading to me. That’s because it seems like the usual reason given for worrying about something like a “one-dimensional morality” is that it won’t reflect that we value multiple things in complicated ways, whereas the whole utility function can reflect that (at least in part, even if it doesn’t fully capture our original, intransitive values).
Am I misunderstanding what you meant there?
Yes, the utility function is the “one dimension”. Of course, it can be as complicated as you’d like, taking into account multiple aspects of reality. But ultimately, it has to give a weight to those aspects; “this 5-year-old’s life is worth exactly XX.XXX times more/less than this 80-year-olds’ life” or whatever. It is a map from some complicated (effectively infinite-dimensional) Omega to a simple one-dimensional utility.
I think you do need preference aggregation, not just conflict resolution. For example, if the AI finds $100 on the ground, should it give the money to Alice or Bob? Either would be a Pareto improvement over the current state. To decide, it needs weighting coefficients for different people’s utilities—which direction vector to take when moving toward the Pareto frontier. That’s the same information as having a joint utility function. And that utility function could end up determining most of the future, if the universe has lots of stuff that the AI can acquire and give to humans, compared to stuff that humans can acquire by themselves.
Good point. Here’s the raw beginnings of a response:
The idea here would be to resolve such questions “democratically” in some sense. I’m intentionally leaving unspecified what I mean by that because I don’t want to give the impression that I think I could ever tie up all the loose ends with this proposal. In other words, this is a toy example to suggest that there’s useful space to explore in between “fully utilitarian agents” and “non-agents”, that agents with weakly-ordered and/or intransitive preferences may in some senses be superior to fully-utilitarian ones.
I realize that “democratic” answers to the issue you raise will tend to be susceptible to the “Omelas problem” (a majority that gets small benefits by imposing large costs on a minority) and/or the repugnant conclusion (“cram the world with people until barely-over-half of their lives are barely-better-than-death”). Thus, I do not think that “majority rules” should actually be a foundational principle. But I do think that when you encounter intransitivity in collective preferences, it may in some cases be better to live with that than to try to subtract it out by converting everything into comparable-and-summable utility functions.
Sometime ago I made this argument to Said Achmiz:
With that in mind, can you try spelling out in what sense an inconsistent system could be better?
Consistency is the opposite of humility. Instead of saying “sometimes, I don’t and can’t know”, it says “I will definitively answer any question (of the correct form)”.
Let’s assume that there is some consistent utility function that we’re using as a basis for comparison. This could be the “correct” utility function (eg, God’s); it could be a given individual’s extrapolated consistent utility; or it could be some well-defined function of many people’s utility.
So, given that we’ve assumed that this function exists, obviously if there’s a quasi-omnipotent agent rationally maximizing it, it will be maximized. This outcome will be at least as good as if the agent is “humble”, with a weakly-ordered objective function; and, in many cases, it will be better. So, you’re right, under this metric, the best utility function is equal-or-better to any humble objective.
But if you get the utility function wrong, it could be much worse than a humble objective. For instance, consider adding some small amount of Gaussian noise to the utility. The probability that the “optimized” outcome will have a utility arbitrarily close to the lower bound could, depending on various things, be arbitrarily high; while I think you can argue that a “humble” deus ex machina, by allowing other agents to have more power to choose between world-states over which the machina has no strict preference, would be less likely to end up in such an arbitrarily bad “Goodhart” outcome.
This response is a bit sketchy, but does it answer your question?
It makes sense to value other agents having power, but are you sure that value can’t be encoded consistently?