so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
This is approximately where I am too btw
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.