Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Iâve been thinking about what results this experiment would yield (have been too lazy to actually perform the experiment myself đ ). Youâve probably already performed the experiment, so my theorizing here probably isnât useful to you, but I thought I should bring it up anyway, so you can correct my theorizing if wrong/âso other people can learn from it.
I believe this dx would immediately bring you âoff the data manifoldâ, perhaps unless the network has been trained to be very robust.
For instance the first eigenvector of the Hessian probably represents the average output of the model, but if e.g. your model is an image classifier and all the images in the dataset have a white background, then rather than just using the networkâs built-in bias parameters to control the average output, it could totally decide to just pick a random combination of those white pixels and use them for the intercept. But thereâs no reason two different networks are going to use the same combination, since itâs a massively underspecified problem, so this dx wonât generalize to other networks.
I did try it on a simple MNIST classifier. The main result was that all effects were dominated by a handful of missclassified or barely-correctly-classified data points, and the phenomenon I originally hypothesized just wasnât super relevant.
Since then, Iâve also tried a different kind of experiment to translate interpretable features across nets, this time on a simple generative model. Basically, the experiment just directly applied the natural abstraction hypothesis to the image-distributions produced by nets trained on the same data (using a first-order approximation). That one worked a lot better, but didnât really connect to peak breadth or even say much about network internals in general.
I did try it on a simple MNIST classifier. The main result was that all effects were dominated by a handful of missclassified or barely-correctly-classified data points, and the phenomenon I originally hypothesized just wasnât super relevant.
Ah, I had been thinking that this method would weight these sorts of data points highly, but I wasnât sure how critical it would be. Iâve assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something youâve experimented with?
⌠I suppose it may make more sense to do this reweighting for my purposes than for yours.
Since then, Iâve also tried a different kind of experiment to translate interpretable features across nets, this time on a simple generative model. Basically, the experiment just directly applied the natural abstraction hypothesis to the image-distributions produced by nets trained on the same data (using a first-order approximation).
When you say âdirectly appliedâ, what do you mean?
That one worked a lot better, but didnât really connect to peak breadth or even say much about network internals in general.
Saying much about network internals seems difficult as ever. I get the impression that these methods canât really do it, due to being too local; they can say something about how the network behaves on the data manifold, but networks that are internally very different can behave the same on the data manifold, and so these methods canât really distinguish those networks.
Meta: Iâm going through a backlog of comments I never got around to answering. Sorry it took three months.
Iâve assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something youâve experimented with?
Something along those lines might work; I didnât spend much time on it before moving to a generative model.
When you say âdirectly appliedâ, what do you mean?
The actual main thing I did was to compute the SVD of the jacobian of a generative network output (i.e. the image) with respect to input (i.e. the latent vector). Results of interest:
Conceptually, near-0 singular values indicate a direction-in-image-space in which no latent parameter change will move the imageâi.e. locally-inaccessible directions. Conversely, large singular values indicate âdegrees of freedomâ in the image. Relevant result: if I take two different trained generative nets, and find latents for each such that they both output approximately the same image, then they both roughly agree on what directions-in-image-space are local degrees of freedom.
By taking the SVD of the jacobian of a chunk of the image with respect to the latent, we can figure out which directions-in-latent-space that chunk of image is locally sensitive to. And then, a rough local version of the natural abstraction hypothesis would say that nonadjacent chunks of image should strongly depend on the same small number of directions-in-latent-space, and be âlocally independentâ (i.e. not highly sensitive to the same directions-in-latent-space) given those few. And that was basically correct.
To be clear, this was all ârough heuristic testingâ, not really testing predictions carefully derived from the natural abstraction framework.
Iâve been thinking about what results this experiment would yield (have been too lazy to actually perform the experiment myself đ ). Youâve probably already performed the experiment, so my theorizing here probably isnât useful to you, but I thought I should bring it up anyway, so you can correct my theorizing if wrong/âso other people can learn from it.
I believe this dx would immediately bring you âoff the data manifoldâ, perhaps unless the network has been trained to be very robust.
For instance the first eigenvector of the Hessian probably represents the average output of the model, but if e.g. your model is an image classifier and all the images in the dataset have a white background, then rather than just using the networkâs built-in bias parameters to control the average output, it could totally decide to just pick a random combination of those white pixels and use them for the intercept. But thereâs no reason two different networks are going to use the same combination, since itâs a massively underspecified problem, so this dx wonât generalize to other networks.
I did try it on a simple MNIST classifier. The main result was that all effects were dominated by a handful of missclassified or barely-correctly-classified data points, and the phenomenon I originally hypothesized just wasnât super relevant.
Since then, Iâve also tried a different kind of experiment to translate interpretable features across nets, this time on a simple generative model. Basically, the experiment just directly applied the natural abstraction hypothesis to the image-distributions produced by nets trained on the same data (using a first-order approximation). That one worked a lot better, but didnât really connect to peak breadth or even say much about network internals in general.
Ah, I had been thinking that this method would weight these sorts of data points highly, but I wasnât sure how critical it would be. Iâve assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something youâve experimented with?
⌠I suppose it may make more sense to do this reweighting for my purposes than for yours.
When you say âdirectly appliedâ, what do you mean?
Saying much about network internals seems difficult as ever. I get the impression that these methods canât really do it, due to being too local; they can say something about how the network behaves on the data manifold, but networks that are internally very different can behave the same on the data manifold, and so these methods canât really distinguish those networks.
Meta: Iâm going through a backlog of comments I never got around to answering. Sorry it took three months.
Something along those lines might work; I didnât spend much time on it before moving to a generative model.
The actual main thing I did was to compute the SVD of the jacobian of a generative network output (i.e. the image) with respect to input (i.e. the latent vector). Results of interest:
Conceptually, near-0 singular values indicate a direction-in-image-space in which no latent parameter change will move the imageâi.e. locally-inaccessible directions. Conversely, large singular values indicate âdegrees of freedomâ in the image. Relevant result: if I take two different trained generative nets, and find latents for each such that they both output approximately the same image, then they both roughly agree on what directions-in-image-space are local degrees of freedom.
By taking the SVD of the jacobian of a chunk of the image with respect to the latent, we can figure out which directions-in-latent-space that chunk of image is locally sensitive to. And then, a rough local version of the natural abstraction hypothesis would say that nonadjacent chunks of image should strongly depend on the same small number of directions-in-latent-space, and be âlocally independentâ (i.e. not highly sensitive to the same directions-in-latent-space) given those few. And that was basically correct.
To be clear, this was all ârough heuristic testingâ, not really testing predictions carefully derived from the natural abstraction framework.