Wei Dai comments on [missing post]

Wei Dai 11 May 2019 4:10 UTC
LW: 2 AF: 1
AF
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
- michaelcohen 11 May 2019 5:59 UTC
  LW: 1 AF: 1
  AF Parent
  This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.