I can’t really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.
Luckily, most of the machinery of the setup isn’t needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don’t have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase “the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states”, but suffice it to say that this function exists. (Really we’re dealing with distributions/stochastic functions, but that doesn’t really change the fundamentals; it just makes it more cumbersome). While I can’t give that function in simple mathematical form, hopefully it’s a legible enough mathematical object.
If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.
Sorry, this isn’t really any more illustrative an example, but hopefully it’s a clearer explanation.
I would be interested in seeing an example that illustrates this failure mode.
I can’t really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.
Luckily, most of the machinery of the setup isn’t needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don’t have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase “the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states”, but suffice it to say that this function exists. (Really we’re dealing with distributions/stochastic functions, but that doesn’t really change the fundamentals; it just makes it more cumbersome). While I can’t give that function in simple mathematical form, hopefully it’s a legible enough mathematical object.
If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.
Sorry, this isn’t really any more illustrative an example, but hopefully it’s a clearer explanation.