Do what we mean vs. do what we say

Written quickly after a CHAI meeting on the topic, haven’t thought through it in depth.

If we write down an explicit utility function and have an AI optimize that, we expect that a superintelligent AI would end up doing something catastrophic, not because it misunderstands what humans want, but because it doesn’t care—it is trying to optimize the function that was written down. It is doing what we said instead of doing what we meant.

An approach like Inverse Reward Design instead says that we should take the human’s written down utility function as an observation about the true reward function, and infer a distribution over true reward functions. This agent is “doing what we mean” instead of doing what we said.

This suggests a potential definition—in a “do what we mean” system, the thing that is being optimized is a latent variable, whereas in a “do what we say” system, it is explicitly specified. Note that “latent” need not mean that you have a probability distribution over it, it just needs to be hidden information. For example, if I had to categorize iterated distillation and amplification, it would be as a “do what we mean” system where the thing being optimized is implicit in the policy of the human and is never made fully certain.

However, this doesn’t imply that we want to build a system that exclusively does what we mean. For example, with IRD, if the true reward function is not in the space of reward functions that we consider (perhaps because it depends on a feature that we didn’t have), you can get arbitrarily bad outcomes (see the problem of fully updated deference). One idea would be to have a “do what we mean” core, which we expect will usually do good things, but have a “do what we say” subsystem that adds an extra layer of safety. For example, even if the “do what we mean” part is completely sure about the human utility function and knows we are making a mistake, the AI will still shut down if we ask it to because of the “do what we say” part. This seems to be the idea in MIRI’s version of corrigibility.

I’d be interested to see disagreements with the definition of “do what we mean” as optimizing a latent variable. I’d also be interested to hear how “corrigibility” and “alignment” relate to these concepts, if at all. For example, it seems like MIRI’s corrigibility is closer to “do what we say” while Paul’s corrigibility is closer to “do what we mean”.