I was confused about this, so I discussed this with Patrick. My current understanding of this framework is something like this:
models are of type X → Y
criteria are of type (X, Y) → Real
F is something like this
def F(model, criteria, data):
return (argmax(m in model_hypothesis_set) score(m, criteria, data),
argmax(c in criteria_hypothesis_set) score(model, c, data))
So we do iterated procedures of optimizing the model to fit the criteria, and optimizing the criteria to make the current model look good. For example, models could be linear regression models (predicting multiple outputs from multiple inputs), and criteria could be weights on outputs that make some errors more costly than other errors. The question is, are there interesting model/criteria sets that don’t “wirehead” by just making the criteria trivial to satisfy? Patrick says this is related to corrigibility, but I don’t really see the correspondence yet.
I don’t have that much to say about the original problem, but would like to note that it is pretty common in machine learning to do the opposite of this: optimize the model to the criteria, and then find new criteria that make the model look as bad as possible:
In boosting: “After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight”
In optimization (e.g. convex optimization), you can use duality to reframe the original “primal” problem (find a good solution satisfying the constraints) as a “dual” problem (find “weights” on constraints to make the optimal solution as bad as possible). See the part of the Wikipedia article about Lagrange duality to see what I mean by “weights on constraints”. While you can use the primal problem to show lower bounds on the highest possible score (by finding feasible points with that score), you can use the dual problem to show upper bounds on the highest possible score (by finding constraint weights that upper bound the score of the problem with these weights).
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)
The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.
In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).
I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.
The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.
I was confused about this, so I discussed this with Patrick. My current understanding of this framework is something like this:
models are of type X → Y
criteria are of type (X, Y) → Real
F is something like this
So we do iterated procedures of optimizing the model to fit the criteria, and optimizing the criteria to make the current model look good. For example, models could be linear regression models (predicting multiple outputs from multiple inputs), and criteria could be weights on outputs that make some errors more costly than other errors. The question is, are there interesting model/criteria sets that don’t “wirehead” by just making the criteria trivial to satisfy? Patrick says this is related to corrigibility, but I don’t really see the correspondence yet.
I don’t have that much to say about the original problem, but would like to note that it is pretty common in machine learning to do the opposite of this: optimize the model to the criteria, and then find new criteria that make the model look as bad as possible:
In boosting: “After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight”
In optimization (e.g. convex optimization), you can use duality to reframe the original “primal” problem (find a good solution satisfying the constraints) as a “dual” problem (find “weights” on constraints to make the optimal solution as bad as possible). See the part of the Wikipedia article about Lagrange duality to see what I mean by “weights on constraints”. While you can use the primal problem to show lower bounds on the highest possible score (by finding feasible points with that score), you can use the dual problem to show upper bounds on the highest possible score (by finding constraint weights that upper bound the score of the problem with these weights).
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)
The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.
In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).
I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.
The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.