Ok I finally identified an incentive for deception. I think it was difficult for me to find because it’s not really about deceiving the evaluator.
Here’s a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task which entails deception when done optimally). In a sense, the AI doesn’t think it’s deceiving the evaluator; it thinks the evaluator fully understands what’s going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is “in on the joke”. One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.
I can’t really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.
Luckily, most of the machinery of the setup isn’t needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don’t have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase “the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states”, but suffice it to say that this function exists. (Really we’re dealing with distributions/stochastic functions, but that doesn’t really change the fundamentals; it just makes it more cumbersome). While I can’t give that function in simple mathematical form, hopefully it’s a legible enough mathematical object.
If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.
Sorry, this isn’t really any more illustrative an example, but hopefully it’s a clearer explanation.
Ok I finally identified an incentive for deception. I think it was difficult for me to find because it’s not really about deceiving the evaluator.
Here’s a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task which entails deception when done optimally). In a sense, the AI doesn’t think it’s deceiving the evaluator; it thinks the evaluator fully understands what’s going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is “in on the joke”. One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.
I would be interested in seeing an example that illustrates this failure mode.
I can’t really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.
Luckily, most of the machinery of the setup isn’t needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don’t have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase “the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states”, but suffice it to say that this function exists. (Really we’re dealing with distributions/stochastic functions, but that doesn’t really change the fundamentals; it just makes it more cumbersome). While I can’t give that function in simple mathematical form, hopefully it’s a legible enough mathematical object.
If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.
Sorry, this isn’t really any more illustrative an example, but hopefully it’s a clearer explanation.