“How conservative” should the partial maximisers be?

Stuart_Armstrong13 Apr 2020 15:50 UTC

LW: 30 AF: 17

Due to the problem of building a strong $V$ -enhancer when we want a $U$ -enhancer—and the great difficulty in defining $U$ , the utility we truly want to maximise—many people have suggested reducing the $V$ -increasing focus of the AI. The idea is that, as long as the AI doesn’t devote too much optimisation power to $V$ , the $V$ and $U$ will stay connected with each other, and hence a moderate increase in $V$ will in fact lead to a moderate increase in $U$ .

This has lead to interest in such things as satisficers and low-impact AIs, both of which have their problems. Those try and put an absolute limit on how much $V$ is optimised. The AI is not supposed to optimise $V$ above a certain limit (satisficer) or if optimising it changes too much about the world or the power of other agents (low-impact).

Another approach is to put a relative limit on how much an AI can push a utility function. For example, quantilizers will choose randomly among the top $0 < q \leq 1$ proportion of actions/policies, rather than picking the top action/policy. Then there is the approach of using pessimism to make the AI more conservative. This pessimism is defined by a parametre $β \in (0, 1)$ , with $β \to 1$ being very pessimistic.

Intermediate value uncertainty

The behaviours of $q$ and $β$ are pretty clear around the extremes. As $β$ and $q$ tend to $0$ , the agent will behave like a $V$ -maximiser. As they tend to $1$ , the agent will behave randomly ( $q$ ) or totally conservatively ( $β$ ).

Thus, we expect that moving away from the extremes will improve the true $U$ -performance, and that the conservative, $1$ end, will be less disastrous than the $V$ -maximising, $0$ end (though we only know that second fact, because of implicit assumptions we have on $U$ and $V$ ).

The problem is in the middle, where the behaviour is unknown (and, since we lack a full formulation of $U$ , generically unknowable). There is no principled way of setting the $q$ or the $β$ . Consider, for example, this plot of $q$ versus $U$ :

Here, the ideal $q$ is around $0.45$ , but the critical thing is to keep $q$ above $0.3$ : that’s the point at which it falls precipitously.

Contrast now with this one:

Here, any value of $q$ above $0.2$ is essentially the same, and $q$ can be lowered as low as $0.05$ before there are any problems.

So, in the first case, we need $q$ above $0.3$ , and, in the second, below $0.2$ . And, moreover, it might be that the first situation appears in one world and the second in another, and both worlds are currently possible. So there’s no consistent good value of $q$ we can set (and in the general case, the curve might be multi-modal, with many peaks). And note that we don’t know any of these graphs (since we can’t define $U$ fully). So we don’t know what values to set $q$ at, have little practical guidance on what to do, but expect that some values will be disastrous.

The conservatism approach has similar problems: $β$ is even harder to interpret than $q$ , we don’t have any guidance on how to set it, and the ideal $β$ may vary considerably depending on the circumstance. For example, what would we want our AI to do when it finds an unexpected red button connected to nuclear weapons?

Well, that depends on whether the button starts a nuclear launch—or if it cancels one.

A future post will explore how to resolve this issue, and how to choose the conservatism parameter in a suitable way.

What links here?

Model splintering: moving from one imperfect model to another by Stuart_Armstrong (27 Aug 2020 11:53 UTC; 79 points)

Stuart_Armstrong13 Apr 2020 15:50 UTC

LW: 30 AF: 17

8 comments2 min readLW link

Decius 14 Apr 2020 6:49 UTC
2 points
0
> choose randomly among the top 0<q≤1 proportion of actions/policies
That requires that you be able to rank actions/policies, which means that they are first reduced to some absolute value on that scale (technically you could do this by merely ranking them, but every sane method of consistently ordering all possible policies is going to reduce each policy to a single value and then sort them by that value).
So… if there are critical values, those values should be apparent in a large gap between the value that you are using to sort them.

Which brings us to the main problem, which is that ranking policies is a hard and unsolved problem that so far has only been reduced to itself.
TurnTrout 13 Apr 2020 15:35 UTC
LW: 2 AF: 1
0
AF

if optimising it changes too much about the world (low-impact).

Although not important for the content of this post, I think this might be better phrased as “if optimizing [the objective function] drastically changes other agents’ abilities to achieve their goals”. In my experience, the “amount of change to the world” framing can be misleading. (See World State is the Wrong Level of Abstraction for Impact and Attainable Utility Landscape: How The World Is Changed)
- Stuart_Armstrong 16 Apr 2020 10:01 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Have slightly rephrased to include this.
- Stuart_Armstrong 13 Apr 2020 15:47 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Possibly, but I think the “amount of change to the world” is broader umbrella term that covers more of the methods that people have been proposing.
  - Dagon 13 Apr 2020 16:16 UTC
    LW: 4 AF: 2
    0
    AF Parent
    “kill all humans, then shut down” is probably the action that most minimizes change. Leaving those buggers alive will cause more (and harder to predict) change than anything else the agent might do.
    There’s no way to talk about this in the abstract sense of change—it has to be differential from a counterfactual (aka: causal), and can only be measured by other agents’ evaluation functions. The world changes for lots of reasons, and an agent might have most of it’s impact by PREVENTING a change, or by FAILING to change something that’s within it’s power. Asimov’s formulation included this understanding: A robot may not injure a human being or, through inaction, allow a human being to come to harm.
    - Stuart_Armstrong 13 Apr 2020 17:12 UTC
      4 points
      0
      Parent
      Yep, been dealing with that issue for some time now ^_^
      
      https://arxiv.org/abs/1705.10720
    - TurnTrout 13 Apr 2020 16:54 UTC
      LW: 2 AF: 1
      0
      AF Parent
      I agree it doesn’t make sense to talk about this kind of change as what we want impact measures to penalize, but i think you could talk about this abstract sense of change. You could have an agent with beliefs about the world state, and some distance function over world states, and then penalize change in observed world state compared to some counterfactual.
      
      This kind of change isn’t the same thing as perceived impact, however.
  - TurnTrout 13 Apr 2020 16:12 UTC
    LW: 2 AF: 1
    0
    AF Parent
    While I see the appeal of having an umbrella description of past approaches, I don’t think we explain the goal of impact measure research in terms of the average proposal so far, but rather, by what impact is. As I argued in the first half of Reframing Impact, people impact each other by changing the other person’s ability to achieve their goal. This is true no matter which impact measure you prefer.
    
    I think that proposals generally fail or succeed to the extent that they are congruent with this understanding of impact. In particular, an impact measure is good for us to the extent that it penalizes policies which destroy our ability to get what we want.