Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Can you unroll that?
“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?
The easiest operationalization starts from the assumption that we train to zero loss. From there, we can calculate the small change in optimal parameter values dθ due to a small change in all the data dx:
(More generally, when calculating maxθu(θ,x), the change in optimal θ-value from a small change in x is given by d2udθ2dθ=−d2udθdxdx.)
The “narrowest directions” are the eigenvectors of the loss Hessian with largest eigenvalue (where the loss Hessian is ∑ndfndθd2Ln(dfn)2dfndθ, i.e. the matrix on the LHS in the formula above). And there’s a ridge in the loss landscape because, if we’re training to zero loss, then presumably we’re in the overparameterized regime.
Note: I think what you’re doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.
The “pointiest directions” in parameter space correspond to the biggest features in the orthogonalised feature set of the network.
So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you’d indeed often find that it’s close to being an eigenvector/most prominent orthogonalised feature of the second network too. Because we know that neural networks tend to learn similar features when trained on similar tasks.
I think it might be interesting to see whether actually modifying the training data in the dx direction would tend to give you a network where the corresponding feature is more prominent, and how large dx can get before that ceases to hold.
Can you unroll that?
“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?
The easiest operationalization starts from the assumption that we train to zero loss. From there, we can calculate the small change in optimal parameter values dθ due to a small change in all the data dx:
(−∑ndfndθd2Ln(dfn)2dfndθ)dθ=∑ndfndθd2Ln(dfn)2dfndxndxn
… where:
fn(θ,xn) is the network output on datapoint n
Ln(fn) is the loss on datapoint n
(More generally, when calculating maxθu(θ,x), the change in optimal θ-value from a small change in x is given by d2udθ2dθ=−d2udθdxdx.)
The “narrowest directions” are the eigenvectors of the loss Hessian with largest eigenvalue (where the loss Hessian is ∑ndfndθd2Ln(dfn)2dfndθ, i.e. the matrix on the LHS in the formula above). And there’s a ridge in the loss landscape because, if we’re training to zero loss, then presumably we’re in the overparameterized regime.
Note: I think what you’re doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.
The “pointiest directions” in parameter space correspond to the biggest features in the orthogonalised feature set of the network.
So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you’d indeed often find that it’s close to being an eigenvector/most prominent orthogonalised feature of the second network too. Because we know that neural networks tend to learn similar features when trained on similar tasks.
I think it might be interesting to see whether actually modifying the training data in the dx direction would tend to give you a network where the corresponding feature is more prominent, and how large dx can get before that ceases to hold.