This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example.

I wouldn’t say that this work it attempting to “exhibit gradient hacking”. (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.

There are a few ways to protect a subnetwork f from being modified by gradient descent that I can think of (non-exhaustive list):

Another way of “protecting” a piece of logic in the network from changes (if we ignore regularization) is by redundancy: Suppose there are two different pieces of logic in the network such that each independently makes the model output what the mesa optimizer wants. Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output, and thus the gradient components of all those weights should be close to zero.

Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output.

You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.

The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights (f,g) completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven’t succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.

I don’t think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function y=C(f1,f2) that takes the outputs from two redundant copies of f and outputs some kind of combined output. (If you’re allowed functions that don’t meet the “continuous everywhere, differentiable countably-almost everywhere” requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to f1 and f2 when they are equal, then it must be that at all points where f1=f2, ∂y∂f1=∂y∂f2=0. There should also exist at least some f1,f2 where C(f1,f1)≠C(f2,f2), since otherwise C no longer depends on the pair of redundant networks at all which means that those networks can’t actually affect what the network does which defeats the whole point of this in the first place.

Let us then define g(x)=C(x,x). Then, dgdx=∂y∂f1+∂y∂f2=0 for all x. This implies that g is a constant function. Therefore, there do not exist f1,f2 where C(f1,f1)≠C(f2,f2). This is a contradiction, and therefore C cannot exist.

To make sure I understand your notation, f1 is some set of weights, right? If it’s a set of multiple weights I don’t know what you mean when you write ∂y∂f1.

There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all

(I don’t yet understand the purpose of this claim, but it seems to me wrong. If C(f1,f1)=C(f2,f2) for every f1,f2, why is it true that C(f1,f2) does not depend on f1 and f2 when f1≠f2?)

When I put it in a partial derivative, f1 represents the output of that subnetwork.

I mean to say that it means it no longer depends on the outputs of f1 and f2 when they’re equal, which is problematic if the purpose of this scheme is to provide stability for different possible functions.

As I said here, the idea here does not involve having some “dedicated” piece of logic C that makes the model fail if the outputs of the two malicious pieces of logic don’t satisfy some condition.

I wouldn’t say that this work it attempting to “exhibit gradient hacking”. (Succeeding in that would require to create a model that can actually model SGD.) Rather, my understanding is that this work is trying to demonstrate techniques that might be used in a gradient hacking scenario.

Another way of “protecting” a piece of logic in the network from changes (if we ignore regularization) is by

redundancy: Suppose there are two different pieces of logic in the network such that each independently makes the model output what the mesa optimizer wants. Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output, and thus the gradient components of all those weights should be close to zero.You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.

Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.

What do you think the gradient of min(x, y) is?

The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights (f,g) completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven’t succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.

I don’t think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function y=C(f1,f2) that takes the outputs from two redundant copies of f and outputs some kind of combined output. (If you’re allowed functions that don’t meet the “continuous everywhere, differentiable countably-almost everywhere” requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to f1 and f2 when they are equal, then it must be that at all points where f1=f2, ∂y∂f1=∂y∂f2=0. There should also exist at least some f1,f2 where C(f1,f1)≠C(f2,f2), since otherwise C no longer depends on the pair of redundant networks at all which means that those networks can’t actually affect what the network does which defeats the whole point of this in the first place.

Let us then define g(x)=C(x,x). Then, dgdx=∂y∂f1+∂y∂f2=0 for all x. This implies that g is a constant function. Therefore, there do not exist f1,f2 where C(f1,f1)≠C(f2,f2). This is a contradiction, and therefore C cannot exist.

To make sure I understand your notation, f1 is some set of weights, right? If it’s a set of multiple weights I don’t know what you mean when you write ∂y∂f1.

(I don’t yet understand the purpose of this claim, but it seems to me wrong. If C(f1,f1)=C(f2,f2) for every f1,f2, why is it true that C(f1,f2) does not depend on f1 and f2 when f1≠f2?)

When I put it in a partial derivative, f1 represents the output of that subnetwork.

I mean to say that it means it no longer depends on the outputs of f1 and f2 when they’re equal, which is problematic if the purpose of this scheme is to provide stability for different possible functions.

As I said here, the idea here does not involve having some “dedicated” piece of logic C that makes the model fail if the outputs of the two malicious pieces of logic don’t satisfy some condition.