Wei Dai comments on [missing post]

Wei Dai 9 May 2019 23:30 UTC
LW: 2 AF: 1
AF
I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.
- michaelcohen 10 May 2019 2:02 UTC
  LW: 3 AF: 2
  AF Parent
  
  In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
  $u_{1} (A) = u_{1} (A^{'}) = u_{1} (A^{''}) = u_{1} (D) = 0$
  $u_{1} (B) = u_{1} (B C) = - 1$
  $u_{1} (C) = 10$
  ETA: And for a sequence of states, $s^{*}$ , $u_{1} (s^{*})$ is the sum of the utilities of the individual states.
  A’ and A” look like A, and BC looks like C.
  $o (A) = o (A^{'}) = o (A^{''}) = A$
  $o (B) = B$
  $o (C) = o (B C) = C$
  $o (D) = D$
  In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
  $w (u = u_{1} | h_{< t}) > 1 - ε$
  The agent is quite sure they’re in state A.
  $w (s_{t} = A | h_{< t}) > 1 - ε$
  The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, $a_{d e l u d e}$ has no effect. The agent won’t capitalize on this confusion.
  $w ({post}_{t} (s_{t} = A | h_{< t}) = 1 / 9 | h_{< t}) > 1 - ε$
  $w ({post}_{t} (s_{t} = A^{''} | h_{< t}) = 8 / 9 | h_{< t}) > 1 - ε$
  The optimal policy is $a_{a t t e m p t}$ , followed by $a_{s t a y}$ (forever) if $o_{t} = C$ , otherwise followed by $a_{g o}$ . Since the agent is all but certain about the utility function, none of the other details matter much.
  Note that the agent could get higher reward by doing $a_{d e l u d e}$ , $a_{a t t e m p t}$ , then $a_{s t a y}$ forever. The reason for this is that after the evaluator observes the observation C, it will assign probability ⁴⁄₅ to being in state C, and probability ¹⁄₅ to being in state BC. Since they will stay in that state forever, ⁴⁄₅ of the time, the reward will be 10, and ¹⁄₅ of the time, the reward will be −1.
  The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
  - Wei Dai 11 May 2019 4:10 UTC
    LW: 2 AF: 1
    AF Parent
    Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
    - michaelcohen 11 May 2019 5:59 UTC
      LW: 1 AF: 1
      AF Parent
      This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.