Value Stability and Aggregation

One of the central problems of Friendly Artificial Intelligence is goal system stability. Given a goal system—whether it’s a utility function, a computer program, or a couple kilograms of neural tissue—we want to determine whether it’s stable, meaning, is there something that might plausibly happen to it which will radically alter its behavior in a direction we don’t like? As a first step in solving this problem, let’s consider a classic example of goal systems that is not stable.

Suppose you are a true Bentham-Mill Utilitarian, which means you hold that the right thing to do is that which maximizes the amount of happiness minus the amount of pain in the world, summed up moment by moment. Call this HapMax for short. You determine this by assigning each person a happiness-minus-pain score at each moment, based on a complex neurological definition, and adding up the scores of each person-moment. One day, you are interrupted from your job as an antidepressant research chemist by a commotion outside. Rushing out to investigate, you find a hundred-foot tall monster rampaging through the streets of Tokyo, which says:

“I am a Utility Monster. Robert Nozick grew me in his underwater base, and now I desire nothing more than to eat people. This makes me very happy, and because I am so very tall and the volume of my brain’s reward center grows with the cube of my height, it makes me *so* happy that it will outweigh the momentary suffering and shortened lifespan of anyone I eat.”

As a true HapMaxer (not to be confused with a human, who might claim to be a HapMaxer but can’t actually be one), you find this very convincing: the right thing to do is to maximize the number of people the monster can eat, so you heroically stand in front of the line of tanks that is now rolling down main street to buy it time. HapMax seemed like a good idea at first, but this example shows that it is very wrong. What lessons should we learn before trying to build another utility function? HapMax starts by dividing the world into pieces, and the trouble starts when one of those agents doesn’t behave as expected.

Dividing and Recombining Utility

Human values are too complex to specify in one go, so like other complex things, we manage the complexity by subdividing the problem, solving the pieces, then recombining them back into a whole solution. Let’s call these sub-problems value fragments, and the recombination procedure utility aggregation. If all of the fragments are evaluated correctly and the aggregation procedure is also correct, then this yields a correct solution.

There are plenty of different ways of slicing up utility functions, and we can choose as many of them as desired. You can slice up a utility function by preference type—go through a list of desirable things like “amount of knowledge” and “minus-amount of poverty”, assign a score to each representing the degree to which that preference is fulfilled, and assign a weighting to each representing its importance and degree of overlap. You can slice them up by branch—go through all the possible outcomes, assigning a score to each outcome representing how nice a world it is and a weighting for probability. You can slice it up by agent—go through all the people you know about, and assign a score for how good things are for them. And you can slice it up by moment—go through a predicted future step by step, and assign a score for how good the things in the world at that moment are. Any of these slices yields value fragments; a fragment is any reference class that describes a portion of the utility function.

Meta-ethics, then, consists of three parts. First, we choose an overall structure, most popularly a predictor and utility function, and subdivide it into fragments, such as by preference, branch, agent, and moment. Then we specify the subdivided parts—either with a detailed preference-extraction procedure like the one Coherent Extrapolated Volition calls for but doesn’t quite specify, or something vague like “preferences”. Finally, we add an aggregation procedure.

The aggregation procedure is what determines how stable a utility function is in the face of localized errors. It was a poor choice of aggregation function that made HapMax fail so catastrophically. HapMax aggregates by simple addition, and its utility function is divided by agent. That makes an awful lot of dissimilar fragments. What happens if some of them don’t behave as expected? Nozick’s Utility Monster problem is exactly that: one of the agents produces utilities that diverge to extremely large values, overpowering the others and breaking the whole utility function.

Aggregation and Error

If human values are as complex as we think, then it is extremely unlikely that we will ever manage to correctly specify every value fragment in every corner case. Therefore, to produce a stable system of ethics and avoid falling for any other sorts of utility monsters, we need to model the sorts of bugs that fragments of our utility function might have, and choose an aggregation function that makes the utility function resilient—that is, we’d like it to keep working and still represent something close to our values even if some of the pieces don’t behave as expected. Ideally, every value would be specified multiple times from different angles, and the aggregation function would ensure that no one bug anywhere could cause a catastrophe.

We saw how linear aggregation can fail badly when aggregating over agents—one agent with a very steep utility function gradient can overpower every other concern. However, this is not just a problem for aggregating agents; it’s also a problem for aggregating preferences, branches, and moments. Aggregation between branches breaks down in Pascal’s Mugging which features a branch with divergent utility, and in Anthropic problems, where the number of branches is not as expected. Aggregation between moments breaks down when considering Astronomical Waste, which features a time range with divergent utility. The effect of linearly aggregating distinct preference types is a little harder to predict, since it depends just what the inputs are and what bugs they have, but they’re mostly as bad as tiling with molecular smiley faces, and Goodhart’s Law suggests that closing every loophole is impossible.

If linear aggregation is so unstable, then how did it become so popular in the first place? It’s not that no other possibilities were considered. For example, there’s John Rawls’ Maximin Principle, which says that we should arrange society so as to maximize how well off the worst-off person is. Now, the Maximin Principle is extremely terrible—it implies that if we find the one person who’s been tortured the most, and we can’t stop them from being tortured but can make them feel better about it by torturing everyone else, then we should do so. But there are some aggregation strategies that fail less badly, and aren’t obviously insane. For example, we can aggregate different moral rules by giving each rule a veto, for predicted worlds and for possible actions. When this fails—if, for example, every course of action is vetoed—it shuts down, effectively reverting to a mostly-safe default. Unfortunately, aggregation by veto doesn’t quite work because it can’t handle trolley problems, where every course of action is somehow objectionable and there is no time to shut down and punt the decision to a human. The space of possible aggregation strategies, however, is largely unexplored. There is one advantage which has been proven unique to linear aggregation, which no other strategy can have: Dutch Book resistance. However, this may be less important than mitigating the damage bugs can do, and it may be partially recoverable by having utility be linear within a narrow range, and switching to something else (or calling on humans to clarify) in cases outside that range.

Classifying Types of Errors

I believe the next step in tackling the value system stability problem is to explore the space of possible aggregation strategies, evaluating each according to how it behaves when the values it aggregates fail in certain ways. So here, then, is a classification of possible value-fragment errors. Each of these can apply to any reference class

  • Deletion: The agent forgets about a fragment. A branch is forgotten about, a preference forgotten, is incorrectly deemed inapplicable, or its fulfillment can’t be predicted.

    • Insertion: A random extra preference is added; a branch that’s actually impossible is predicted as an outcome; an agent that doesn’t exist or isn’t morally significant is posited.

      • Divergence: A value fragment or its gradient has a value with a much larger magnitude than expected, possibly infinite or as large as an arbitrary value chosen by some agent.

        • Noise: Each fragment’s estimated utility has an error term added, from a gaussian, log-normal or other distribution.

          • Scaling: The agent encounters or envisions a scenario in which the number of times a value is tested for is qualitatively different than expected.

            A good utility function, if it contains subdivisions, must be able to survive errors in any one or even several of those divisions while still representing something close to our values. What sort of function might achieve that purpose?