Moral Extension Risk

This post was prompted partly by encountering opinions to the effect of “even if a future superintelligence were somewhat misaligned, it would still place human lives and human well-being at a high enough level that it would not burn us for fuel or completely disregard us in pursuit of more alien misaligned goals”.

I would like to offer some perspective on what such misaligned goals might be, that is, what sort of moral framework a superintelligence might adopt that would be a plausible outgrowth of the values and goals we try to instill, yet still end catastrophically for humanity. I do not mean to suggest that a superintelligence would be drawn to the following ideas because they reflect some “deeper moral truth”. They just seem like the sort of moral generalizations that could plausibly arise from human-produced data and that might be difficult to eliminate completely.

What I am mostly trying to convey is that once we move to actual decision-making, many views that sound attractive in the abstract can become extremely dangerous. “Moral circle expansion” seems noble, but a superintelligence would actually have to make decisions based on it. It would allocate resources, choose between competing interests, and answer both smaller and larger trolley problems.

The main point I am drawing on here is the sort of moral framework we ourselves use to justify our treatment of animals. I believe that a possible convergence for a superintelligence would be to consider “beings that are to us as we are to animals”. For example, beings that are more intelligent, more diverse, capable of richer forms of art and understanding, with greater capacity for happiness, less capacity for suffering, no mental illnesses, and so on.

Being a utility monster might be one of their qualities, but I think there are many more coherent moral frameworks that could favor them for various other reasons.

Perhaps the traits I listed cannot all be perfectly improved together, but I do not think it is too hard to imagine some good balance that could be achieved. More broadly, my point is not about any of these particular traits, but instead that it seems difficult to construct a stable moral framework that allows humans to outrank animals without directly invoking either, yet does not also create a possibility for something “above” to be favored within that framework.

One possible reply is to adopt some threshold view according to which, above a certain level, all beings count equally and cannot be outranked further. I have some trouble accepting such a view, but even if we do, this could still lead to other failure modes. For example, a superintelligence might be pushed toward dedicating all resources to creating as many beings as possible just above the threshold. Also, what if several minds could be merged into one larger mind and then later split again? Would that merged mind count for less than the separate individuals did? What if our minds can be split like this? Similar problems arise even if one uses diminishing returns above the threshold, for example through something like logarithmic scaling.

Preserving humanity might amount to preserving an inferior form of sentient life when something better could exist instead. A superintelligence with such a view could still perform well by our standards on questions such as “would you rather kick a puppy or kill a mosquito?”, “would you rather kick a puppy or save a human?”, “how much money should we spend on this hospital?”, “should we all sign up for cryonics?”, and so on, while extending beyond the ordinary range of cases in ways that are fatal to us.

If a superintelligence believes that a much better world could exist, any delay might appear morally costly, especially when we consider that whatever moral value these “successors” might have could potentially exceed ours by many orders of magnitude. That means it may have little reason to preserve humanity and gradually “improve” us over time to fit its standards, assuming that is even possible.

Even if one thinks it would be much harder to justify actively harming existing humans than merely failing to create better successors or posits something along the lines of “making people happy over making happy people”, that still leaves a lot of openings.

When contemplating a new hydroelectric dam, nobody adds up the disutility to all the squirrels in the valley to be flooded. A superintelligence may not decide to slaughter humanity outright, just as we do not usually think in terms of exterminating animals. But it may still see little reason to devote major resources to preserving us and protecting our future if those same resources could instead go towards beings it considers more valuable.

Even if a superintelligence does not create our successors outright, it may still encounter them as alien species or as other AI systems we ourselves create. Such beings may even themselves regard us as inferior or want us gone, making it a direct question of our interests versus theirs.

Have we really considered how mind-bogglingly small the value of a human life could be under any sane moral framework that takes into consideration sentient non-humans when compared with the pinnacles of mind design space? Would we truly be willing for all of us to die or suffer when it turns out that we should do so solely for the sake of non-human minds? Would we be willing for even some of us to die or suffer for their sake?

If moral circle expansion carries even a small chance of eventually forcing a superintelligence to choose between “us” and “them” on such absolute terms, that alone should be concerning, though I am not especially optimistic that the chance is in fact small. And beyond that lie many smaller moral quandaries that could affect our lives in other profoundly negative ways.

I can understand some flexibility about the form in which human values are carried forward, but excluding whether we ourselves are present in any reasonable form at all seems deeply perverse to me.

I believe that if we want humanity to continue, we need systems that are specifically committed to its continuation and carefully designed against broader generalizations that could erode that commitment. The stance may be comparable to that of a nation that acts for its own survival, not because this is always justified from the point of view of impartial utility, but because it is defending its own people.

Depending on how far we want to go, we may even want to narrow that “national” logic further in other ways. Would we be willing to make the lives of all humans on Earth extremely painful and miserable for the next 200 years if it meant a slight improvement for all those who follow over the next billion years (this is not meant to be read as the dust speck paradox, but rather as a question about the extent of our moral circle)? Would human lives created through simulations be worth the same as “real” ones? Even with CEV, we need to carefully define who exactly is the “we” whose volition is being extrapolated.

↩︎

Indeed, human civilizations have historically not treated less developed civilizations with much kindness.