Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Thesis: The problem of fully updated deference is not a strong argument against the viability of the assistance games / utility uncertainty approach to AI (outer) alignment.

Background: A proposed high-level approach to AI alignment is to have the AI maintain a probability distribution over possible human utility functions instead of optimizing for any particular fixed utility function. Variants of this approach were advocated by Stuart Russell in Human Compatible and by Hadfield-Menell et al in the CIRL paper. Adding utility uncertainty intuitively seems to provide a number of safety benefits relative to having a fixed objective, including:

Utility uncertainty gives the AI an incentive to adjust in response to a human operator’s corrective actions.
Utility uncertainty weakens the AI’s incentive to harm its human operators, since this might result in a permanent loss of utility-relevant information.
Utility uncertainty incentivizes the AI to avoid irreversible changes to the state of the world, since those might lead to permanently low utility.

Despite the high profile and intuitive appeal of utility uncertainty, almost none of the alignment researchers I know consider it a promising approach to AI alignment. The most common reason cited seems to be the problem of fully updated deference (e.g. Richard Ngo’s alignment research exercises point to this as the reason for why CIRL doesn’t solve the alignment problem).

In this post I will argue why fully updated deference should not be seen as a strong argument against utility uncertainty as approach to AI alignment. This is not meant as an argument in favor of the uncertainty approach; it may have other irresolvable difficulties which I discuss briefly in the conclusion.

Outline: The Arbital post that seems to be the canonical reference for updated deference contains many heuristic arguments and one concrete, worked-out example in the section Moral uncertainty and its relation to corrigibility. I will mostly engage with the example, and argue that

It conflates the problem of updated deference with the independent problem of prior mis-specification.
If we remove prior mis-specification, there is no problem in the limit of increasing AI capability.

The Problem of Updated Deference

The example in the post has an AI that is uncertain between three utility functions $U_{1}$ , $U_{2}$ , $U_{3}$ whereas the human’s true utility function is $V$ . The AI believes that the utility that will be attained in each of the three possible worlds is

$u_{i}$ with AI assistance
$v_{i}$ if the human optimizes $V$ without the AI’s assistance (e.g. because the humans shut the AI down)

If the AI is much more powerful than humans, the argument goes, then $u_{i} >> v_{i}$ in any of the three worlds, so the AI will not let itself be shut down. The uncertainty doesn’t help because the AI can choose to keep gathering information until it has fully updated. Since it’s more powerful than the humans, it can gather that information more efficiently when it’s not shut down, and therefore ignores the shutdown signal.

Factoring out prior mis-specification

The original example has the AI system assign probability 0 to the true human utility function $V$ , presumably because its prior probability was 0. I think any advocate of the utility uncertainty approach would agree that assigning a nonzero prior to the true human utility function is critical for the approach to work. Describing such a prior abstractly is easy (just take the Solomonoff prior over programs), implementing a real CIRL agent that reasons with such a prior could be intractably difficult, but this is clearly a separate problem from “fully updated deference”. So from now on we’ll assume that the human $V$ is one of $U_{1}$ , $U_{2}$ , $U_{3}$ , and see if we still have a problem.

Dissolving Updated Deference

The AI refuses to be shut down because it believes that it can achieve higher utility than the humans in any of the three possible worlds if the humans shut it down. The key question to consider is: is the AI right that shutdown is bad for the true human utility function?

Insofar as “shut down is bad” is a mistaken belief, we expect the problem of updated deference to dissolve as AI capabilities grow, since more capable AIs will make fewer mistakes. Note that in the original example, the plausibility of the AI’s belief relies on the AI system being better at optimizing than unassisted humans, but “unassisted humans” is not likely to be the real world counterfactual. If the humans were able to deploy an AI system this powerful, they could also deploy another AI system equally powerful and (plausibly) more aligned. In other words, $u_{i} \leq v_{i}$ with very high probability, contrary to assumption. So the AI will shut down unless it expects the humans to do something irreversibly bad after shutting it down, which brings us to:
Insofar as “shut down is bad” is a correct belief, there is no problem—even a fully aligned superintelligence should be expected to resist shutdown if it believed this would lead to a permanent and irreversible loss of utility to humans. This could happen e.g. if the AI was confident that the humans would deploy a catastrophically unaligned AI on the next iteration, or if it believed humans would permanently curtail their technological potential. In other words, in this very unusual scenario where humans are about to make a catastrophic mistake, hard corrigibility and alignment are at odds. I don’t think this scenario will happen, but if it does I think it’s clear we should choose alignment over corrigibility.

A counter-argument to 1 would be that it is very possible for an AI system to be extremely capable but still have mistaken beliefs. This could be because there is an error in its source code; but this objection applies to almost any alignment approach. A more serious objection to the utility uncertainty agenda is that truth-seeking is anti-competitive and we will by default select models more for their ability to take impactful actions than for their ability to have true beliefs about the world. In fact, the core argument of Human Compatible is that we should work on differentially improving our models’ ability to reason about uncertainty relative to their ability to optimize over actions. It may be that this a good strategy in theory but too hard in practice (it imposes too much of an alignment tax) but that argument should be made explicitly and it has little to do with updated deference.

Conclusion

So what does this tell us about whether utility uncertainty is a promising approach to AI alignment? Not much. I do think the “problem of updated deference” is better understood as a combination of prior mis-specification and competitiveness penalties from maintaining well-calibrated true beliefs. But I basically agree with Rohin that utility uncertainty smuggles all the complexity of alignment into

creating a “reasonable” prior over reward functions
creating a “reasonable” human model
finding a computationally tractable way to do Bayesian inference with (1) and (2)

and it’s not obvious whether this is actual conceptual progress, or a wrong-way reduction of a hard problem into an impossible problem.