One of the questions I’ve heard asked is “how do you design a utility function that would make the AI do nothing?” That is, how could we put a pause button on an AI so that we could pause it if we wanted to? I had an idea about how one could do this, and am sure it has already been thought of, so I’m curious to know why this doesn’t end up working.
Why can’t we just write something like:
If (paused) and not (taking actions): utility=PredictedUtilityIfNotPaused
If not (paused): utility=NormalUtilityFunction
If
NormalUtilityFunction
assigns a higher score to futures where its agent avoids being paused, then that has the usual problem of instrumental convergence avoiding shut down or utility function modification.I’ll try modifying your example to instead use
UtilityFunction2
, which assigns, to worlds where its agent ends up paused, a score equal to the average score assigned to {worlds where the agent will not end up paused} in the agent’s prior as it was just before updating on observing being paused. This doesn’t mean it wouldn’t still instrumentally converge and start disassembling things[1], just that if you tried to pause it, it wouldn’t have a direct incentive to specifically try to stop you, per se, because if you succeed it still assigns score equal to that of the average world in its prior where you didn’t succeed. It has some interesting indirect incentives to avoid being paused (or pause itself), though, which I get into below.Consider if this utility function were to score the worlds in which the agent will take one action or the other: action (1), which has a 99% of chance leading to being paused, and results in an average score of 10 if not paused; and action (2), which has a 5% chance of leading to being paused, and results in an average score of 5 if not paused. In this case,
UtilityFunction2
prefers the world where the agent takes action (1). (This alone does not lead to resisting pauses or self-pausing; this is just an observation for now).Next, consider how “the average score of {worlds where the agent will not end up paused} in the agent’s prior as it was just before updating on observing being paused” does depend on the agent’s own action policy. In short, the agent can commit to pausing itself in all but the highest-scoring world in its prior to create an extreme version of action (1). (Longer, original explanation in footnote[2])
Separately, I also just noticed that, as worded—“as [the prior] was just before updating on observing being paused”—there are yet more unexpected implications. For example: at time 0, the set of non-paused futures are generally not low-probability exceptional worlds; But one hour later, the agent will soon be paused, absent its action; And the moment just before being paused, it is all but certain that it will be paused—and so, in the prior at that final point, the set of worlds where it does not end up paused are exceptional ones, like simulation hypotheses where there’s an intervention at the last moment, which of course may themselves have higher or lower scores than remaining unpaused (through resistance). One could try to avoid this by specifying “some number of updates before”, e.g. “the prior as it was 10000 updates before updating on being paused”, but this would create new complications (e.g., if that agent learns information which decreases the score it assigns to the world, it will pause itself if no outweighing positive update occurs in the next 9999 updates).
If this utility function is otherwise a classical physical-thing-maximizer
“Specifically, {the score of the average world where the agent does not end up paused} could be higher or lower conditional on whether the agent tries to resist being paused, (as resisting would add more worlds to the set of non-paused worlds, whose scores are averaged to return the score of any paused world). The same applies to pausing itself in case it observes a negative or not-as-positive worldstate, to have such worldstates instead be assigned the score of the remaining unpaused ones. (This does lead to self-pausing in all possible futures except the highest-scoring one)”
One could try to avoid this by specifying “some number of updates before”, e.g. “the prior as it was 10000 updates before updating on being paused”, but this would create new complications (e.g., if that agent were to learn information which decreased the score assigned to the world, it would pause itself if no outweighing positive update occurred in the next 9999 updates).
Thank you for this answer - I really like it! I’m trying to wrap my head the last 2 paragraphs.
2nd to last paragraph:
Ok, so you’re saying that it could choose to self-pause unless it was in the highest-scoring world? I’m conceptualizing a possible world as an (action,result) pair, from which it could calculate (action, E[result]) pairs and then would choose the action with the highest E[result], while being paused would also provide max(E[result]). So are you saying it would limit the possible actions it would take? That seems like it wouldn’t change anything since it is always going to just take the one best action anyway. Or that by setting a self-pausing policy it could alter E[result]? That sounds possible to me but I don’t have a concrete example of how that would work. Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn’t win? Or do you have something else in mind?
Last paragraph:
If just prior to being paused, there exists 1 scenario where it won’t be paused, then it could be an average, low, or high utility scenario. Obviously, average is fine. And if it’s really high, then it will get a lot of utility from being paused and certainly we’re not worried about it self-pausing when surrounded by agents trying to pause it. So, if it’s a really low utility scenario where it won’t end up being paused, then sure, it won’t get much utility being paused, but since it won’t get much utility if it doesn’t end up being paused, why should it have a preference? And, we could say—well, but it could fight back and then create a high-utility scenario—but then that would be the utility it would get if it doesn’t end up paused, so it would get the high utility paused and again be indifferent.
It sounds like understanding functional decision theory might help you understand the parts you’re confused about?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I’m not sure if I’m responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn’t super clear to me)
(To be clear, I’m conceptualizing the agent as having Bayesian uncertainty about what world it’s in, and this is what I meant when writing about “worlds in the agent’s prior”)
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior’s change, which in turn logically changes the prior, and so on). This seems like a case of ‘subjunctive dependence’ to me (even though it’s a bit of an edge case of that, where the two logically-corresponding things—what action an agent will choose, and the agent’s prior about what action they will choose—are both localized in the same agent), which is why functional decision theory seems relevant.
I think there must be some confusion here, but I’m having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it’s prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent’s policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)
Your idea seems to break when AI is being unpaused: as it has not done any beneficial actions, utility would suddenly go down from “simulated” to “normal”, meaning that AI will likely resist waking it up.
Also, it assumes there is a separate module for making predictions, which cannot be manipulated by the agent. This assumption is not very probable in my view.
If the AI is resisting being turned on, then it would have to be already on, by which point the updates (to the AI’s prior, and score assigned to it) would have already happened.
Isn’t this a blocker for any discussion of particular utility functions?