Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC

LW: 34 AF: 13

Written quickly after a CHAI meeting on the topic, haven’t thought through it in depth.

If we write down an explicit utility function and have an AI optimize that, we expect that a superintelligent AI would end up doing something catastrophic, not because it misunderstands what humans want, but because it doesn’t care—it is trying to optimize the function that was written down. It is doing what we said instead of doing what we meant.

An approach like Inverse Reward Design instead says that we should take the human’s written down utility function as an observation about the true reward function, and infer a distribution over true reward functions. This agent is “doing what we mean” instead of doing what we said.

This suggests a potential definition—in a “do what we mean” system, the thing that is being optimized is a latent variable, whereas in a “do what we say” system, it is explicitly specified. Note that “latent” need not mean that you have a probability distribution over it, it just needs to be hidden information. For example, if I had to categorize iterated distillation and amplification, it would be as a “do what we mean” system where the thing being optimized is implicit in the policy of the human and is never made fully certain.

However, this doesn’t imply that we want to build a system that exclusively does what we mean. For example, with IRD, if the true reward function is not in the space of reward functions that we consider (perhaps because it depends on a feature that we didn’t have), you can get arbitrarily bad outcomes (see the problem of fully updated deference). One idea would be to have a “do what we mean” core, which we expect will usually do good things, but have a “do what we say” subsystem that adds an extra layer of safety. For example, even if the “do what we mean” part is completely sure about the human utility function and knows we are making a mistake, the AI will still shut down if we ask it to because of the “do what we say” part. This seems to be the idea in MIRI’s version of corrigibility.

I’d be interested to see disagreements with the definition of “do what we mean” as optimizing a latent variable. I’d also be interested to hear how “corrigibility” and “alignment” relate to these concepts, if at all. For example, it seems like MIRI’s corrigibility is closer to “do what we say” while Paul’s corrigibility is closer to “do what we mean”.

What links here?

Alignment Newsletter #22 by Rohin Shah (3 Sep 2018 16:10 UTC; 18 points)

Rohin Shah30 Aug 2018 22:03 UTC

LW: 34 AF: 13

14 comments1 min readLW link

Corrigibility AI

TurnTrout 21 Aug 2020 12:57 UTC
LW: 6 AF: 4
AF
I liked this post when it came out, and I like it even more now. This also brings to mind Paul’s more recent Inaccessible Information.
- Rohin Shah 22 Aug 2020 2:43 UTC
  LW: 4 AF: 3
  AF Parent
  Thanks! I like it less now, but I suppose that’s to be expected (I expect I publish posts when I’m most confident in the ideas in them).
  I do think it’s aged better than my other (non-public) writing at the time, so at least past-me was calibrated on which of my thoughts were good, at least according to current-me?
  The main way in which my thinking differs is that I’m less optimistic about defining things in terms of what “optimizing” is happening—it seems like such a definition would be too vague / fuzzy / filled with edge cases to be useful for AI alignment. I do think that the definition could be used to construct formal models that can be analyzed (as had already been done in assistance games / CIRL or the off switch game).
  The definition is also flawed; it clearly can’t be just about optimizing a latent variable, since that’s true of any POMDP. What the agent ends up optimizing for depends entirely on how the latent variable connects to the agent’s observations; this clearly isn’t enough for do what we mean. I think the better version is Stuart’s three principles in Human Compatible (summary).
Dagon 31 Aug 2018 18:24 UTC
4 points
There’s another layer of uncertainty here. For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
“do what I would want to mean” is closer, but figuring out the counterfactuals for “would” that preserve “I” is not easy.
- Charlie Steiner 31 Aug 2018 21:04 UTC
  1 point
  Parent
  Agreed. Humans don’t really have utility functions. We might try to get around this by having the AI learn how humans would like to be interpreted as having a utility function, and how they would like that to be interpreted, and so on in an infinite tower of reflection, but that doesn’t seem very practical or desirable.
  I think there was an old Wei Dai post on “artificial philosophy” that was about this problem? The idea is we want the AI to collapse this infinite tower by learning the philosophical considerations that generate it, then use that knowledge to learn its preferences from humans.
  - Rohin Shah 1 Sep 2018 18:31 UTC
    1 point
    Parent
    Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
    Like, if someone tells me that they want me to protect nature, I know that in effect they mean “Take actions to protect nature right now, but don’t do anything super drastic that would conflict with other things I care about, and if I change my mind in the future, defer to that change, etc.” I think a good “do what you mean” system would capture all of that. This isn’t implied by my definition of course, but I think that a system where the specification is latent and uncertain could have this property.
    - Nebu 10 Sep 2018 5:43 UTC
      1 point
      Parent
      Just don’t ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
      I believe that reduces to “solve the Friendly AI problem”.
      - Rohin Shah 10 Sep 2018 23:44 UTC
        3 points
        Parent
        (Pedantic note: the right way to say that is “the Friendly AI problem reduces to that”.)
        I’m replying to the quote from the first comment:
        For sufficiently general and long-term preferences, it’s not clear that “do what we mean” is sufficient either. None of us knows what we want, so we what we mean isn’t even very good evidence of what we want.
        What I’m trying to say is that once you have a “do what we mean” system, then don’t explicitly ask your AI system to optimize for general and long-term preferences without a way for you to say “actually, stop, I changed my mind”.
        I claim that the hard part there is in building a “do what we mean” system, not in the “don’t explicitly ask for a bad thing” part.
TurnTrout 31 Aug 2018 3:57 UTC
2 points
Perhaps “do what we say” is more like “know when the outside view says you’ve incorrectly converged to the wrong value function, so we’re probably right and you should listen to us”.
- Charlie Steiner 31 Aug 2018 20:44 UTC
  3 points
  Parent
  It’s somewhat more subtle than that. The ideal (and maybe impossible) corrigible AI should protect us even if we accidentally give the AI the wrong process for figuring out what to value. It should protect us even if the AI becomes omniscient.
  If the AI knows vastly more than we do, there’s no sense in which we are providing extra evidence or an information-carrying “outside view”. We are instead just registering a sort of complaint and hoping we’ve programmed the AI to listen.
  I’m still not convinced that such a sort of corrigibility is in any way distinct from some extra complications in the process we give the AI for figuring out what to value.
  - TurnTrout 1 Sep 2018 2:44 UTC
    3 points
    Parent
    The outside view I had in mind wasn’t with respect to its knowledge, but to empirical data on how often its exact value-learning algorithm converges to the correct set of preferences for agents-like-us. That feels different.
- Rohin Shah 1 Sep 2018 18:34 UTC
  1 point
  Parent
  I assume you’re talking about the particular “do what we say” subsystem described in the second last paragraph? If so, that seems plausibly right.
Pattern 30 Aug 2018 23:12 UTC
2 points
What we say: “Follow the recipe”
What we mean: “Make tasty, edible, food, with the ingredients provided, after verifying they are what they’re supposed to be, etc. ”
I think this is related, although it’s about getting the AI to ask humans questions about what to value.
avturchin 1 Sep 2018 9:20 UTC
1 point
This approach ignores choice. To have an utility function is not enough to make a choice, and what I say is an act of making a choice.
For example, I have hidden value function (apples = 0.5 and oranges =0.5). I ask my home robot to bring me an apple. In that moment I made a choice between equally preferable preferences.
But my home robot would ignore my choice and bring me half of apple and half of orange, because this was my value function before making the choice.
In that case, I will be not satisfied as I will feel that robot ignores my moral efforts of making a choice, and I value my choices. Also, after making the choice my preferences will be updated, so the robot should decide which my utility function should be used: before the choice or after.
- Dagon 1 Sep 2018 15:16 UTC
  2 points
  Parent
  (I don’t think humans have consistent utility functions; we’re broken that way. If we did...)
  The robot should know your utility function(s) well enough to know that you’d choose apple this time, and orange at some future time.