Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?
It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compensate.
Here, there are degrees of freedom in specifying the color basis and parameters can probably be eliminated, but it would be more interesting to see examples where two semantically different algorithms fall within the same basin without removable degrees of freedom, either because the Hessian has no zero eigenvalues, or because parameters cannot be removed despite the Hessian having a zero eigenvalue.
From this paper, “Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)”
So for overparameterized nets, the answer is probably:
There is only one solution manifold, so there are no separate basins. Every solution is connected.
We can salvage the idea of “basin volume” as follows:
In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
In the dimensions parallel to the manifold, ask “how can I move before it stops being the ‘same function’?”. If we define “sameness” as “same behavior on the validation set”,[1] then this means looking at the Jacobian of that behavior in the plane of the manifold.
Multiply the two hypervolumes to get the hypervolume of our “basin segment” (very roughly, the region of the basin which drains to our specific model)
Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?
It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compensate.
Here, there are degrees of freedom in specifying the color basis and parameters can probably be eliminated, but it would be more interesting to see examples where two semantically different algorithms fall within the same basin without removable degrees of freedom, either because the Hessian has no zero eigenvalues, or because parameters cannot be removed despite the Hessian having a zero eigenvalue.
From this paper, “Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)”
So for overparameterized nets, the answer is probably:
There is only one solution manifold, so there are no separate basins. Every solution is connected.
We can salvage the idea of “basin volume” as follows:
In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
In the dimensions parallel to the manifold, ask “how can I move before it stops being the ‘same function’?”. If we define “sameness” as “same behavior on the validation set”,[1] then this means looking at the Jacobian of that behavior in the plane of the manifold.
Multiply the two hypervolumes to get the hypervolume of our “basin segment” (very roughly, the region of the basin which drains to our specific model)
There are other “sameness” measures which look at the internals of the model; I will be proposing one in an upcoming post.