p.b. comments on The Pragmascope Idea

p.b. 5 Aug 2022 8:14 UTC
4 points
0
- Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Can you unroll that?
“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?
- johnswentworth 5 Aug 2022 15:54 UTC
  4 points
  0
  Parent
  The easiest operationalization starts from the assumption that we train to zero loss. From there, we can calculate the small change in optimal parameter values $d θ$ due to a small change in all the data $d x$ :
  $(- \sum_{n} \frac{d f^{n}}{d θ} \frac{d^{2} L^{n}}{(d f^{n})^{2}} \frac{d f^{n}}{d θ}) d θ = \sum_{n} \frac{d f^{n}}{d θ} \frac{d^{2} L^{n}}{(d f^{n})^{2}} \frac{d f^{n}}{d x^{n}} d x^{n}$
  … where:
  - $f^{n} (θ, x^{n})$ is the network output on datapoint n
  - $L^{n} (f^{n})$ is the loss on datapoint n
  (More generally, when calculating ${max}_{θ} u (θ, x)$ , the change in optimal $θ$ -value from a small change in $x$ is given by $\frac{d^{2} u}{d θ^{2}} d θ = - \frac{d^{2} u}{d θ d x} d x$ .)
  The “narrowest directions” are the eigenvectors of the loss Hessian with largest eigenvalue (where the loss Hessian is $\sum_{n} \frac{d f^{n}}{d θ} \frac{d^{2} L^{n}}{(d f^{n})^{2}} \frac{d f^{n}}{d θ}$ , i.e. the matrix on the LHS in the formula above). And there’s a ridge in the loss landscape because, if we’re training to zero loss, then presumably we’re in the overparameterized regime.
  - Lucius Bushnaq 22 Aug 2022 11:50 UTC
    7 points
    1
    Parent
    Note: I think what you’re doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.
    The “pointiest directions” in parameter space correspond to the biggest features in the orthogonalised feature set of the network.
    So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you’d indeed often find that it’s close to being an eigenvector/most prominent orthogonalised feature of the second network too. Because we know that neural networks tend to learn similar features when trained on similar tasks.
    I think it might be interesting to see whether actually modifying the training data in the dx direction would tend to give you a network where the corresponding feature is more prominent, and how large dx can get before that ceases to hold.
    What links here?
    Lucius Bushnaq's comment on Basin broadness depends on the size and number of orthogonal features by CallumMcDougall (31 Aug 2022 12:59 UTC; 3 points)