leogao comments on Obstacles to gradient hacking

leogao 6 Sep 2021 16:16 UTC
LW: 3 AF: 1
0
AF
The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights ( $f, g$ ) completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven’t succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.
I don’t think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function $y = C (f_{1}, f_{2})$ that takes the outputs from two redundant copies of $f$ and outputs some kind of combined output. (If you’re allowed functions that don’t meet the “continuous everywhere, differentiable countably-almost everywhere” requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to $f_{1}$ and $f_{2}$ when they are equal, then it must be that at all points where $f_{1} = f_{2}$ , $\frac{\partial y}{\partial f_{1}} = \frac{\partial y}{\partial f_{2}} = 0$ . There should also exist at least some $f_{1}, f_{2}$ where $C (f_{1}, f_{1}) \neq C (f_{2}, f_{2})$ , since otherwise $C$ no longer depends on the pair of redundant networks at all which means that those networks can’t actually affect what the network does which defeats the whole point of this in the first place.
Let us then define $g (x) = C (x, x)$ . Then, $\frac{d g}{d x} = \frac{\partial y}{\partial f_{1}} + \frac{\partial y}{\partial f_{2}} = 0$ for all $x$ . This implies that $g$ is a constant function. Therefore, there do not exist $f_{1}, f_{2}$ where $C (f_{1}, f_{1}) \neq C (f_{2}, f_{2})$ . This is a contradiction, and therefore $C$ cannot exist.
What links here?
- Gradient descent is not just more efficient genetic algorithms by leogao (8 Sep 2021 16:23 UTC; 56 points)
- Ofer 9 Sep 2021 14:58 UTC
  LW: 1 AF: 1
  0
  AF Parent
  To make sure I understand your notation, $f_{1}$ is some set of weights, right? If it’s a set of multiple weights I don’t know what you mean when you write $\frac{\partial y}{\partial f_{1}}$ .
  
  There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all
  
  (I don’t yet understand the purpose of this claim, but it seems to me wrong. If $C (f_{1}, f_{1}) = C (f_{2}, f_{2})$ for every $f_{1}, f_{2}$ , why is it true that $C (f_{1}, f_{2})$ does not depend on $f_{1}$ and $f_{2}$ when $f_{1} \neq f_{2}$ ?)
  - leogao 9 Sep 2021 21:39 UTC
    1 point
    0
    Parent
    When I put it in a partial derivative, $f_{1}$ represents the output of that subnetwork.
    I mean to say that it means it no longer depends on the outputs of $f_{1}$ and $f_{2}$ when they’re equal, which is problematic if the purpose of this scheme is to provide stability for different possible functions.
    - Ofer 10 Sep 2021 23:36 UTC
      1 point
      0
      Parent
      As I said here, the idea here does not involve having some “dedicated” piece of logic C that makes the model fail if the outputs of the two malicious pieces of logic don’t satisfy some condition.