Great to see more work on (better) influence functions!
Lots of interesting things to discuss here[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as t→∞. What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.
Influence functions suffer from another shortcoming, since they only use final weights (as you are aware). So you might say that we shouldn’t do influence functions, but track a different counterfactual: The counterfactual over training “What if I added/removed a sample zm at time step t”. To do this, you can consider each SGD training step θt→θt+1 (or more generally some optimizer like Adam), and approximate the Jacobian of that map, i.e. θt+1≈θt+At⋅(θt+1−θt). Doing some calculus you end up with At=I−λt⋅Ht, where λt is the lr and Ht
You can use this linear approximation of training steps to compute a new counterfactual (Eq. 57 in Mlodozeniec et al.) . This can be formalized as a pair (θt,rt) of the weights θt and the response rt which captures the counterfactual, i.e. θ′t(ϵ)≈θt+ϵ⋅rt, where θ′t(ϵ) is the counterfactual of adding the data point with weighting ϵ at time step t. Ok, without further ado, here is the result (Theorem 2 in Mlodozeniec et al.):
Under some assumptions on SGD (A1-A6 in the paper) as you continue training t→∞, you get an a.s. convergence (θt,rt)→(θ∞,r∞) where θ∞ is a local minimum or a saddle point. Assume it is a local minimum, what is the optimal response r∞? It’s our beloved (pseudo-)inverse Hessian vector product (IVHP) from classical IFs, well… up to directions in weight space which are in the kernel of the Hessian.
So to summarize, the upshot is that influence functions actually can be valid beyond the original statistical setup, if (1) We model training dynamics linearly (2) We believe the assumptions A1-A6 + that we end up in a local minimum eventually, (3) We care about the behaviour limit. These assumptions can and should be debated, but I find them more reasonable and interesting than the global minimum assumption.
And as a cherry on the top, Theorem 3 shows that if you want go from the Bayesian posterior p(w∣D) to the epsilon perturbed p(w∣Dϵ) , you can again use IFs: Sampling from the perturbed distribution is approximated by sampling from the original distribution and adding the IF IVHP. Amongst linear approximations this one (in a specific sense, in the low temperature limit) is optimal for the KL divergence.[3]
More generally, I think this paper makes an important point that goes beyond any of these technical details above: We want our counterfactual estimations to be more robust against randomness in the training, but that’s for another time.
Great to see more work on (better) influence functions!
Lots of interesting things to discuss here[1], but one thing I would like to highlight is that classical IFs indeed arise when you do the usual implicit function theorem + global minimum assumption (which is obviously violated in the context of DL), but they also arise as the limit of unrolling as t→∞. What follows will be more of theoretical nature summarizing statements in Mlodozeniec et al.
Influence functions suffer from another shortcoming, since they only use final weights (as you are aware). So you might say that we shouldn’t do influence functions, but track a different counterfactual: The counterfactual over training “What if I added/removed a sample zm at time step t”. To do this, you can consider each SGD training step θt→θt+1 (or more generally some optimizer like Adam), and approximate the Jacobian of that map, i.e. θt+1≈θt+At⋅(θt+1−θt). Doing some calculus you end up with At=I−λt⋅Ht, where λt is the lr and Ht
the mini-batch Hessian at time step t.[2]
You can use this linear approximation of training steps to compute a new counterfactual (Eq. 57 in Mlodozeniec et al.) . This can be formalized as a pair (θt,rt) of the weights θt and the response rt which captures the counterfactual, i.e. θ′t(ϵ)≈θt+ϵ⋅rt, where θ′t(ϵ) is the counterfactual of adding the data point with weighting ϵ at time step t. Ok, without further ado, here is the result (Theorem 2 in Mlodozeniec et al.):
Under some assumptions on SGD (A1-A6 in the paper) as you continue training t→∞, you get an a.s. convergence (θt,rt)→(θ∞,r∞) where θ∞ is a local minimum or a saddle point. Assume it is a local minimum, what is the optimal response r∞? It’s our beloved (pseudo-)inverse Hessian vector product (IVHP) from classical IFs, well… up to directions in weight space which are in the kernel of the Hessian.
So to summarize, the upshot is that influence functions actually can be valid beyond the original statistical setup, if (1) We model training dynamics linearly (2) We believe the assumptions A1-A6 + that we end up in a local minimum eventually, (3) We care about the behaviour limit. These assumptions can and should be debated, but I find them more reasonable and interesting than the global minimum assumption.
And as a cherry on the top, Theorem 3 shows that if you want go from the Bayesian posterior p(w∣D) to the epsilon perturbed p(w∣Dϵ) , you can again use IFs: Sampling from the perturbed distribution is approximated by sampling from the original distribution and adding the IF IVHP. Amongst linear approximations this one (in a specific sense, in the low temperature limit) is optimal for the KL divergence.[3]
More generally, I think this paper makes an important point that goes beyond any of these technical details above: We want our counterfactual estimations to be more robust against randomness in the training, but that’s for another time.
e.g. I am not sure if agree regarding the dataset vs model size tradeoff, but maybe we have slightly different applications in mind :)
Small upshot here is that we get a natural damping which mitigates degeneracy of the Hessian
I would be curious to understand how this compares to the relationship you present in appendix A