Oh, I see. The reason my argument is wrong is because while for a specific s0,prior,u, the optimal policy is independent of the evaluator, you don’t get to choose a separate policy for each s0,prior,u: you have to use the evaluator to distinguish which case you are in, and then specialize your policy to that case.
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?
I think I intuitively agree but I also haven’t checked it formally. But the point about no-deception seems to be similar to the point about observation-utility maximizers not wanting to wirehead. This agent also ends up learning which utility function is the right one, and in that sense is like the Value Learning agent.
so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
Oh, I see. The reason my argument is wrong is because while for a specific s0,prior,u, the optimal policy is independent of the evaluator, you don’t get to choose a separate policy for each s0,prior,u: you have to use the evaluator to distinguish which case you are in, and then specialize your policy to that case.
I think I intuitively agree but I also haven’t checked it formally. But the point about no-deception seems to be similar to the point about observation-utility maximizers not wanting to wirehead. This agent also ends up learning which utility function is the right one, and in that sense is like the Value Learning agent.
so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.
Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.
This is approximately where I am too btw
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.
And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.
An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.