TurnTrout comments on [missing post]

TurnTrout 8 May 2019 23:54 UTC
LW: 2 AF: 1
AF
so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
- Rohin Shah 9 May 2019 3:53 UTC
  LW: 4 AF: 2
  AF Parent
  Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
  - michaelcohen 9 May 2019 4:08 UTC
    LW: 1 AF: 1
    AF Parent
    This is approximately where I am too btw
- michaelcohen 9 May 2019 5:11 UTC
  LW: 1 AF: 1
  AF Parent
  defining the evaluator is a fuzzy problem
  I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
  if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
  I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
  - TurnTrout 9 May 2019 16:01 UTC
    LW: 2 AF: 1
    AF Parent
    What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
    
    And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
    - michaelcohen 10 May 2019 0:21 UTC
      LW: 1 AF: 1
      AF Parent
      An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.