This is a cool result. If I’m understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn’t have access to the weights? I think that probably requires an RL environment, but that’s entirely based on “I thought about it for a few minutes and couldn’t find a way to do it without RL” so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...

When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .

But, if the latter, then, I would think that the solutions would be more solutions than that?

Like, what about g(w):=c1/(1−w)+c2⋅(1−1106+1106cos(w)) ? (where, say, c1=ε+δ and c2=−δ g′(w)=c1(1−w)2+c2⋅(−1106sin(w)) . so (1−w)g′(w)−g(w)=c2⋅(−1106(1−w)sin(w))−c2⋅(1−1106+1106cos(w)) =−c2⋅(1−10−6⋅(1−cos(w)−(1−w)sin(w))) which, for c2<0, and w∈[0,1], is positive, and so g should also be a solution to (1−w)f′(w)−f(w)>0 , yes?

Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?

Separately (btw, what’s the convention if one has unrelated comments? to put them in one comment or separate comments?) : As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.

Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren’t especially big) of the weights in the network, and then for the loss function, take the main part of the network’s output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network’s weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)

Why “a linear combination of the weights” instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between “how the net should be”, and “a net that is doing gradient hacking”, then it seems like “some particular weighted sum of the weights”, might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the w1 and w2 separate.

On the other hand, if the network isn’t recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I’d think it couldn’t protect these later parts from being changed by gradient descent, or at least, it would be harder.)

(I’m imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)

When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .

I meant the former (which you’re right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.

Epistemic status: Somewhat confused by the scenario described here, possible noob questions and/or commentary.

I am not seeing how this toy example of “gradient hacking” could actually happen, as it doesn’t map on to my understanding of how gradient descent is supposed to work in any realistic case.

Suppose, we have a mixture consisting of a good model M+ which gets 0 loss in the limit (because it’s aligned with our training procedure) and a gradient hacker M− which gets ϵloss in the limit (because its actual objective is paperclips).

The mixture is roughly a linear combination of the two models, weighted by w1 for the good model and w2 for the bad model. Suppose we’re doing SGD on the vector (w1,w2), and the total loss for the mixture is Lt(w1,w2)=w1L(M+)+w2L(M−)

Can you flesh this out more by giving an example of what a training data point would be in this scenario? What are the input features of a training case and what output target variable is the model trying to predict?

Also, it is not clear from this example if the outputs of the whole composite model are:

A weighted average of the outputs of the 2 component models OR

The weighted sum of w1(M+Output)+w2(M−Output)

But either way, I don’t see why the loss for the composite model would necessarily be equal to a weighted sum of the losses of each component model, as stated in the example above.

In principle, it seems to me the loss on such a composite model could (for example) be 0 on some data set, even if both components had a positive loss function value for that same data.

As a simplified example of this, suppose you have a 2-component weighted-average model that is trying to predict the Y value of points in a 2-dimensional (X,Y) coordinate plane given the X value. And suppose all the data points are on the line:

Y=X+1

If the 2 components of the model are:

Y=X AND

Y=X+2

I would expect the weighted average composite model would have a lower loss then either component model (and 0 loss if the weights are equal).

Also suppose M− can see (w1,w2) perfectly and its loss is a function of its relative weight in the mixture. I.e. L(M−)=f(w1,w2), which I’ll refer to as M−’s loss modulating function.

I am disinclined to suppose this—unless someone explains how it could make sense for the machine learning engineers to do it that way.

My understanding is that the loss function of a model should track how close the model’s answers are to the correct or optimal answers for some data. But in this example, W1 and W2 are not part of the training data, and can vary independently from the M− outputs on the training cases.

M− has a calculable loss on the training data, which means M− could be a stand-alone model for this dataset in its own right even if M+, W1, and W2 didn’t exist.

(Which is to say, M− takes inputs in a form that would represent the features of a training data point and deterministically produces outputs of a data type that matches the target variable. This seems to me to be implied by the description, otherwise I don’t understand what it would mean to say that the loss of M− is ϵ.)

Furthermore, this example supposes that the gradient descent process is only optimizing W1 and W2, and not optimizing, or altering the internal parameters of, M+ or M−.

So it seems to me, based on this description, that the loss for M−on a given set of training data should *not* vary with W1 or W2 - if they are doing gradient descent in any kind of normal way (unless I am misunderstanding some big part of how gradient descent works). Rather, you should be able to give M− the same training data batch 1000 times in a row, while varying W1 and W2, and you should get the same outputs and the same loss (if the parameters for the stand-alone M− are the same each time).

So I don’t see how “gradient hacking” could occur in this scenario if the composite model is using any reasonable loss function.

If the composite model is a weighted average I would expect gradient descent to reduce W2 to 0 or nearly 0, since if M+ is matching the correct output exactly, and M− is not, then the composite model can always get a closer answers by giving more relative weight to M+.

If the composite model is a weighted sum of the outputs, I would expect that (for most possible training data sets and versions of M−) W1 would tend to gravitate towards 1 and W2 would tend to gravitate towards 0. There might be exceptions to this if M−‘s outputs have a strong correlation with M+’s outputs on the training data, such that the model could achieve low loss with some other weighted sum, but I would expect that to be unusual.

To be clear, I haven’t explained how M− could arise nor how it’s implementing f. There are other posts that explain why gradient hacking might be a problem and informal ‘requirements’ that gradient hacking models might meet. I’m just trying to answer IF we already have a gradient hacking model, what’s the theoretical best it can do.

Can you flesh this out more by giving an example of what a training data point would be in this scenario? What are the input features of a training case and what output target variable is the model trying to predict?

Also, it is not clear from this example if the outputs of the whole composite model are:

A weighted average of the outputs of the 2 component models OR

The weighted sum of w1(M+Output)+w2(M−Output)

Output would be task-specific. In the scenario you later describe with predicting functions, you’d probably want to take option 1 (which is what parametrizing on one weight in [0,1] also accomplishes). I’ll go with that example.

But either way, I don’t see why the loss for the composite model would necessarily be equal to a weighted sum of the losses of each component model, as stated in the example above.

In principle, it seems to me the loss on such a composite model could (for example) be 0 on some data set, even if both components had a positive loss function value for that same data.

As a simplified example of this, suppose you have a 2-component weighted-average model that is trying to predict the Y value of points in a 2-dimensional (X,Y) coordinate plane given the X value. And suppose all the data points are on the line:

Y=X+1

If the 2 components of the model are:

Y=X AND

Y=X+2

The intent of the scenario is to find what model dominates, so probably loss should be non-negative. If you use squared error in that scenario, then the loss of the mixture is always greater than or equal to the loss of any particular model in the mixture.

Also suppose M− can see (w1,w2) perfectly and its loss is a function of its relative weight in the mixture. I.e. L(M−)=f(w1,w2), which I’ll refer to as M−’s loss modulating function.

I am disinclined to suppose this—unless someone explains how it could make sense for the machine learning engineers to do it that way.

My understanding is that the loss function of a model should track how close the model’s answers are to the correct or optimal answers for some data. But in this example, W1 and W2 are not part of the training data, and can vary independently from the M− outputs on the training cases.

M− has a calculable loss on the training data, which means M− could be a stand-alone model for this dataset in its own right even if M+, W1, and W2 didn’t exist.

I don’t think ML engineers will pass in weights of the models to the models themselves (except maybe for certain tasks like game-theoretic simulations). The worry is that data spills easily and that SGD might find absurd, unpredictable ways to sneak weights (or some other correlated variable) into the model. Once it does, the model’s loss is already a function of its weight (though it may of course be constant). I wouldn’t expect to run into this sort of problem with weak models operating on well-defined tasks (e.g. function prediction). Gradient hacking (unless in contrived examples) probably wouldn’t occur until we get very capable models that are fed a lot of data (the calculation of weights from which might be easy) and it may in fact never occur at all!

The result of the post is that there exists a loss-modulating function that ‘tricks’ SGD into preferring a model that is actively worse simply because the model ‘holds out’ or ‘strategically fails’ in such a way that the gradient flows toward it. How likely these functions are or if there are basins around these functions (i.e. if M− is currently outputting f′ which is sort of close to f, would SGD update M− toward f or toward some constant function? If M− is outputting f(w)=ϵ, is there any optimization pressure towards changing f to be more gradient hacky?) are open problems.

The intent of the scenario is to find what model dominates, so probably loss should be non-negative. If you use squared error in that scenario, then the loss of the mixture is always greater than or equal to the loss of any particular model in the mixture.

I don’t see why that would necessarily be true. Say you have 3 data points from my Y=X+1 example from above:

(0,1)

(1,2)

(2,3)

And say the composite model is a weighted average of Y=X and Y=X+2 with equal weights (so just the regular average).

This means that the composite model outputs will be:

For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model. Furthermore, if you have a such a model, and one component is getting the answers exactly correct while the other isn’t, you can always get a lower loss by giving more weight to the component model with exactly correct answers. So I would a gradient descent process to do that.

I don’t think ML engineers will pass in weights of the models to the models themselves (except maybe for certain tasks like game-theoretic simulations). The worry is that data spills easily and that SGD might find absurd, unpredictable ways to sneak weights (or some other correlated variable) into the model.

From the description, it sounded to me like this instance of gradient descent is treating the outputs of the component models M− and M+ as features in a linear regression type problem.

In such a case, I would not expect data about the weights of each model to “spill” or in any way affect the output of either component model (unless the machine learning engineers are deliberately altering the data inputs depending on what the weights are, or something like that, and I see no reason why they would do that).

If it is a different situation—like if a neural net or some part or some layers of a neural net is a “gradient hacker” I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.

So barring some outside interference with the gradient descent process, I don’t see any concrete scenario of how gradient hacking could occur (unless the gradient hacking concept includes more mundane phenomena like “getting stuck in a local optimum”).

For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model.

Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).

I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.

I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s some optimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.

This is a cool result. If I’m understanding correctly, M- increases its loss the more that M+ is represented in the mixture, thereby encouraging SGD to make M- more prominent.

Is there a way to extend this to cases where M- doesn’t have access to the weights? I think that probably requires an RL environment, but that’s entirely based on “I thought about it for a few minutes and couldn’t find a way to do it without RL” so I could be way off here.

Given an RL environment I suspect M- could steer the model into scenarios that make it look better than M+...

When you say “which yields a solution of the form f(w)=c1/(1−w)+c2”, are you saying that f′(w)/f(w)=1/(1−w) yields that, or are you saying that (1−w)f′(w)−f(w)>0 yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form f(w)=c1/(1−w) .

But, if the latter, then, I would think that the solutions would be more solutions than that?

Like, what about g(w):=c1/(1−w)+c2⋅(1−1106+1106cos(w)) ? (where, say, c1=ε+δ and c2=−δ

g′(w)=c1(1−w)2+c2⋅(−1106sin(w)) . so (1−w)g′(w)−g(w)=c2⋅(−1106(1−w)sin(w))−c2⋅(1−1106+1106cos(w))

=−c2⋅(1−10−6⋅(1−cos(w)−(1−w)sin(w)))

which, for c2<0, and w∈[0,1], is positive, and so g should also be a solution to (1−w)f′(w)−f(w)>0 , yes?

Or, were you just giving a subset of the solutions, of the differential inequality, which were all the ones you needed to make the point?

Separately (btw, what’s the convention if one has unrelated comments? to put them in one comment or separate comments?) :

As for how the network could detect something about its current weights, if it is a recurrent network, this seems like something that some form of this should be learnable.

Suppose you set up the architecture for a recurrent neural network, and pick some arbitrary linear combination (but where the coefficients aren’t especially big) of the weights in the network, and then for the loss function, take the main part of the network’s output as doing some normal task, and then another part of the output be judged for how well it approximates the current value of the linear combination of all the network’s weights, with the initially chosen coefficients. It seems plausible to me that this would do well. Of course, one way it might fail to work properly is if this linear combination ends up becoming roughly constant, or just if some weights stop changing during the training, leading to this second output not getting the signal needed to learn that part. Maybe if you used dropout and took that which weights were dropped out when evaluating the weighted sum of the weights (counting them as zero) in the loss, it could fix that issue. (dropout being done the same in each recurrence, not separately.)

Why “a linear combination of the weights” instead of the particular weights? Well, I imagine getting all the individual weights would be not particularly feasible unless the architecture was specifically designed to do that (like in the architectures for those quine-ing neural nets), but, if we are looking at a mix between “how the net should be”, and “a net that is doing gradient hacking”, then it seems like “some particular weighted sum of the weights”, might be all that would be needed. Or perhaps two weighted sums with different coefficients if having the w1 and w2 separate.

On the other hand, if the network isn’t recurrent, or if all the recurrence is done in a vector-quantization-y way, I would imagine it would be more difficult for the network to estimate a given linear combination of its current weights. (Though, if the linear combination of weights is only of weights before a certain layer, then layers after that layer might be able to approximate it without any recurrence, but, the layers that do and use this approximation could be modified without this being detected, and therefore, I’d think it couldn’t protect these later parts from being changed by gradient descent, or at least, it would be harder.)

(I’m imagining, if people are trying to make this sort of thing work, that these networks are using just RELU, not like a smoothed version of RELU, so that the values of the weights are reflected more piecewise-linearly in the activations of one layer, given activations in the previous layer, which I think would facilitate estimating other linear combinations of the weights between the previous layers, and as such also for estimating this for a linear combination of weights in any layer.)

I meant the former (which you’re right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.

Epistemic status: Somewhat confused by the scenario described here, possible noob questions and/or commentary.I am not seeing how this toy example of “gradient hacking” could actually happen, as it doesn’t map on to my understanding of how gradient descent is supposed to work in any realistic case.

Can you flesh this out more by giving an example of what a training data point would be in this scenario? What are the input features of a training case and what output target variable is the model trying to predict?

Also, it is not clear from this example if the outputs of the whole composite model are:

A weighted average of the outputs of the 2 component models OR

The weighted sum of w1(M+Output)+w2(M−Output)

But either way, I don’t see why the

lossfor the composite model would necessarily be equal to a weighted sum of thelossesof each component model, as stated in the example above.In principle, it seems to me the loss on such a composite model could (for example) be 0 on some data set, even if both components had a positive loss function value for that same data.

As a simplified example of this, suppose you have a 2-component weighted-average model that is trying to predict the Y value of points in a 2-dimensional (X,Y) coordinate plane given the X value. And suppose all the data points are on the line:

Y=X+1

If the 2 components of the model are:

Y=X AND

Y=X+2

I would expect the weighted average composite model would have a lower loss then either component model (and 0 loss if the weights are equal).

I am disinclined to suppose this—unless someone explains how it could make sense for the machine learning engineers to do it that way.

My understanding is that the loss function of a model should track how close the model’s answers are to the correct or optimal answers for some data. But in this example, W1 and W2 are not part of the training data, and can vary independently from the M− outputs on the training cases.

M− has a calculable loss on the training data, which means M− could be a stand-alone model for this dataset in its own right even if M+, W1, and W2 didn’t exist.

(Which is to say, M− takes inputs in a form that would represent the features of a training data point and deterministically produces outputs of a data type that matches the target variable. This seems to me to be implied by the description, otherwise I don’t understand what it would mean to say that the loss of M− is ϵ.)

Furthermore, this example supposes that the gradient descent process is only optimizing W1 and W2, and not optimizing, or altering the internal parameters of, M+ or M−.

So it seems to me, based on this description, that the loss for M−on a given set of training data should *not* vary with W1 or W2 - if they are doing gradient descent in any kind of normal way (unless I am misunderstanding some big part of how gradient descent works). Rather, you should be able to give M− the same training data batch 1000 times in a row, while varying W1 and W2, and you should get the same outputs and the same loss (if the parameters for the stand-alone M− are the same each time).

So I don’t see how “gradient hacking” could occur in this scenario if the composite model is using any reasonable loss function.

If the composite model is a weighted average I would expect gradient descent to reduce W2 to 0 or nearly 0, since if M+ is matching the correct output exactly, and M− is not, then the composite model can always get a closer answers by giving more relative weight to M+.

If the composite model is a weighted sum of the outputs, I would expect that (for most possible training data sets and versions of M−) W1 would tend to gravitate towards 1 and W2 would tend to gravitate towards 0. There might be exceptions to this if M−‘s outputs have a strong correlation with M+’s outputs on the training data, such that the model could achieve low loss with some other weighted sum, but I would expect that to be unusual.

To be clear, I haven’t explained how M− could arise nor how it’s implementing f. There are other posts that explain why gradient hacking might be a problem and informal ‘requirements’ that gradient hacking models might meet. I’m just trying to answer

IFwe already have a gradient hacking model, what’s the theoretical best it can do.Output would be task-specific. In the scenario you later describe with predicting functions, you’d probably want to take option 1 (which is what parametrizing on one weight in [0,1] also accomplishes). I’ll go with that example.

The intent of the scenario is to find what model dominates, so probably loss should be non-negative. If you use squared error in that scenario, then the loss of the mixture is always greater than or equal to the loss of any particular model in the mixture.

I don’t think ML engineers will pass in weights of the models to the models themselves (except maybe for certain tasks like game-theoretic simulations). The worry is that data spills easily and that SGD might find absurd, unpredictable ways to sneak weights (or some other correlated variable) into the model. Once it does, the model’s loss is already a function of its weight (though it may of course be constant). I wouldn’t expect to run into this sort of problem with weak models operating on well-defined tasks (e.g. function prediction). Gradient hacking (unless in contrived examples) probably wouldn’t occur until we get very capable models that are fed a lot of data (the calculation of weights from which might be easy) and it may in fact never occur at all!

The result of the post is that there exists

aloss-modulating function that ‘tricks’ SGD into preferring a model that is actively worse simply because the model ‘holds out’ or ‘strategically fails’ in such a way that the gradient flows toward it. How likely these functions are or if there are basins around these functions (i.e. if M− is currently outputting f′ which is sort of close to f, would SGD update M− toward f or toward some constant function? If M− is outputting f(w)=ϵ, is there any optimization pressure towards changing f to be more gradient hacky?) are open problems.I don’t see why that would necessarily be true. Say you have 3 data points from my Y=X+1 example from above:

(0,1)

(1,2)

(2,3)

And say the composite model is a weighted average of Y=X and Y=X+2 with equal weights (so just the regular average).

This means that the composite model outputs will be:

Y=(FirstComponentOutput)+(SecondComponentOutput)2=X+(X+2)2=2X+22=X+1

Thus the composite model would be right on the line, and get each data point Y-value exactly right (and have 0 loss).

The squared error loss would be:

TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2

=((0+1)−1)2+((1+1)−2)2+((2+1)−3)2=0

By contrast, each of the two component models would have a total squared error of 3 for these 3 data points.

The Y=X component model would have total squared error loss of:

TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2

=(0−1)2+(1−2)2+(2−3)2=3

The Y=X + 2 component model would have total squared error loss of:

TotalLoss=(ModelOutput(0)−1)2+(ModelOutput(1)−2)2+(ModelOutput(2)−3)2

=((0+2)−1)2+((1+2)−2)2+((2+2)−3)2=3

For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model. Furthermore, if you have a such a model, and one component is getting the answers exactly correct while the other isn’t, you can always get a lower loss by giving more weight to the component model with exactly correct answers. So I would a gradient descent process to do that.

From the description, it sounded to me like this instance of gradient descent is treating the outputs of the component models M− and M+ as features in a linear regression type problem.

In such a case, I would not expect data about the weights of each model to “spill” or in any way affect the output of either component model (unless the machine learning engineers are deliberately altering the data inputs depending on what the weights are, or something like that, and I see no reason why they would do that).

If it is a different situation—like if a neural net or some part or some layers of a neural net is a “gradient hacker” I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.

So barring some outside interference with the gradient descent process, I don’t see any concrete scenario of how gradient hacking could occur (unless the gradient hacking concept includes more mundane phenomena like “getting stuck in a local optimum”).

Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would

actuallybe a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).I agree, but the question of in what direction SGD changes the model (i.e. how it changes f) seems to have some recursive element analogous to the situation above. If the model is really close to the f above, then I would imagine there’s

someoptimization pressure to update it towards f. That’s just a hunch, though. I don’t know how close it would have to be.