Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.