Challenge: construct a Gradient Hacker

9 Mar 2023 2:38 UTC

LW: 42 AF: 16

This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We’re throwing it out to the world to see if anyone can make progress.

Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions.

Setup: Let $x \in R^{d}$ be the network inputs, $θ \in R^{w}$ , be the network trainable parameters, $^y \in R$ be the network outputs, and $f (x, θ) =^y$ be the network architecture. $f (x, θ)$ must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input.

Suppose that we train this network on a dataset $D = {x^{(i)}, y^{(i)}}$ , with MSE loss $L (θ) = \frac{1}{n} \sum_{i} (f (x^{(i)}, θ) - y^{(i)})^{2}$ , assuming gradient flows: $θ^{'} (t) = - \nabla_{θ} L (θ (t))$ . Given an initial point $θ_{0}$ , the training dynamics are therefore:

θ (t) = θ_{0} - \int_{0}^{t} \nabla_{θ} L (θ (x)) d x

Definition: A tuple $(f, θ_{g}, D)$ is a gradient hacker if the following conditions hold:

$θ_{g}$ is not a global optimum: $\exists θ$ such that $L (θ) < L (θ_{g})$
$\exists ϵ > 0$ such that $\forall θ_{0} : | | θ_{0} - θ_{g} | | < ϵ$ :
1. the network’s training converges to $θ_{g}$ , i.e., ${lim}_{t \to \infty} θ (t) = θ_{g}$ .
2. There is internal structure of $f (\cdot, θ_{0})$ that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is $\nabla_{θ_{0}} L (θ_{0})$ . ^[1]

This captures my intuition that a gradient hacker knows where it wants to go (in this case “get to $θ_{g}$ ”), and then it should decide what it outputs in order to make the gradient true.

Some more ambitious problems (if gradient hackers exist):

Characterize the set of all gradient hackers.
Show that they all must satisfy some property.
Construct gradient hackers for arbitrarily large $n$ , $d$ , $w$ , and neural net depth.
Variations on the problem: a subset of the activations equals $\nabla_{θ_{0}} L (θ_{0})$ for every input, or the subset of activations correspond to the gradient on that input.

^
This is a bit strict, but we didn’t want to list different ways something could be isomorphic to the gradient.

What links here?

Thomas Larsen and Thomas Kwa

9 Mar 2023 2:38 UTC

LW: 42 AF: 16

10 comments1 min readLW link

Gradient Hacking AI

johnswentworth 9 Mar 2023 5:13 UTC
LW: 13 AF: 9
4
AF
Seems like the easiest way to satisfy that definition would be to:
- Set up a network and dataset with at least one local minimum which is not a global minimum
- … Then add an intermediate layer which estimates the gradient, and doesn’t connect to the output at all.
- Thomas Larsen 10 Mar 2023 0:38 UTC
  7 points
  0
  Parent
  This feels like cheating to me, but I guess I wasn’t super precise with ‘feedforward neural network’. I meant ‘fully connected neural network’, so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as
  $f (x, θ) = σ_{n} \circ W_{n} \circ \dots σ_{1} \circ W_{1} [x, θ]^{T}$
  where the weight matrices are some nice function of $θ$ (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in $ϕ$ and produces the $W_{i}$ matrices that are actually used in the forward pass.)
  I guess I should be more precise about what ‘nice means’, to rule out weight sharing functions that always zero out input, but it turns out this is kind of tricky. Let’s require the weight sharing function $ϕ : R^{w} \to R^{W}$ to be differentiable and have image that satisfies $[- 1, 1] \subset p r o j_{n} i m (ϕ)$ for any projection. (A weaker condition is if the weight sharing function can only duplicate parameters).
- Thomas Larsen 10 Mar 2023 0:51 UTC
  5 points
  0
  Parent
  This is a plausible internal computation that the network could be doing, but the problem is that the gradients flow back through from the output to the computation of the gradient to the true value y, and so GD will use that to set the output to be the appropriate true value.
  - Thomas Larsen 20 Mar 2023 21:10 UTC
    2 points
    0
    Parent
    Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection $\nabla_{θ} L (θ) - - \to ϵ \cdot I d o u t p u t$ , there is now a connection from the internal representation of $y$ to the output, and so training will send this thing to the function $f (D, θ) \approx y$ .
- abhayesian 10 Mar 2023 15:17 UTC
  4 points
  0
  Parent
  I’m a bit confused as to why this would work.
  If the circuit in the intermediate layer that estimates the gradient does not influence the output, wouldn’t they just be free parameters that can be varied with no consequence to the loss? If so, this violates 2a since perturbing these parameters would not get the model to converge to the desired solution.
  - johnswentworth 10 Mar 2023 16:32 UTC
    2 points
    0
    Parent
    Good point. Could hardcode them, so those parameters aren’t free to vary at all.
Stephen Fowler 9 Mar 2023 3:36 UTC
4 points
4
This is very interesting and I’ll have a crack. I have an intuition that two restrictions are too strong, and a proof in either direction will exploit them without telling you much about real gradient hackers:
1. Having to take your weights as inputs:
  If I understand this correctly the weight restrictions needed mean $θ$ actually lives on a d dimensional subspace of $R^{w}$ . My intuition is that in real life the scenarios we expect gradient hackers to get clues to their parameters but not necessarily a full summary.
  
  In a reality maybe the information would be clues that the model can gleam from it’s own behaviour about how it’s being changed. Perhaps this is some low dimensional summary of it’s functional behaviour on a variety of inputs?
2. Having to move toward a single point $θ_{g}$ :
  I think the phenomena should not be viewed in parameter space, but function space although I do anticipate that makes this problem harder.
  Instead of expecting the gradient hacker to move toward the single point $θ_{g}$ , it is more appropriate to expect it to move towards the set of parameters that correspond to the same function (if such redundancy exists).
  Secondly, I’m unsure but If the model is exhibiting gradient hacking behaviour, does that necessarily mean it tends toward a single point in function space. Perhaps the selections pressures would focus on some capabilities (such as the capacity to gradient hack) and not preserve others.
Epistemics: Off the cuff. Second point about preserving capabilities seems broadly correct but a bit too wide for what you’re aiming to actually achieve here.
Zachary Robertson 11 Mar 2023 0:52 UTC
3 points
−2
[Deleted]
James Payor 10 Mar 2023 1:09 UTC
LW: 3 AF: 2
0
AF
My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn’t something that you’d reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I’d try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.
- Thomas Larsen 10 Mar 2023 1:51 UTC
  3 points
  4
  Parent
  At least under most datasets, it seems to me like the zero NN fails condition 2a, as perturbing the weights will not cause it to go back to zero.