Rohin Shah comments on [missing post]

Rohin Shah 8 May 2019 23:32 UTC
LW: 4 AF: 2
AF
Oh, I see. The reason my argument is wrong is because while for a specific $s_{0}, prior, u$ , the optimal policy is independent of the evaluator, you don’t get to choose a separate policy for each $s_{0}, prior, u$ : you have to use the evaluator to distinguish which case you are in, and then specialize your policy to that case.
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?
I think I intuitively agree but I also haven’t checked it formally. But the point about no-deception seems to be similar to the point about observation-utility maximizers not wanting to wirehead. This agent also ends up learning which utility function is the right one, and in that sense is like the Value Learning agent.
- TurnTrout 8 May 2019 23:54 UTC
  LW: 2 AF: 1
  AF Parent
  so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
  - Rohin Shah 9 May 2019 3:53 UTC
    LW: 4 AF: 2
    AF Parent
    Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
    - michaelcohen 9 May 2019 4:08 UTC
      LW: 1 AF: 1
      AF Parent
      This is approximately where I am too btw
  - michaelcohen 9 May 2019 5:11 UTC
    LW: 1 AF: 1
    AF Parent
    defining the evaluator is a fuzzy problem
    I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
    if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
    I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
    - TurnTrout 9 May 2019 16:01 UTC
      LW: 2 AF: 1
      AF Parent
      What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
      
      And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
      - michaelcohen 10 May 2019 0:21 UTC
        LW: 1 AF: 1
        AF Parent
        An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.