I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.