I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let A,B,C be training data sets. Let M,N be neural networks. First train M on A and N on B. Then slowly switch the training sets, so that we eventually train both M and N on just C. After fully training M and N, one should be able to train an isomorphism between the networks M and N (here I assume that M and N are designed properly so that they can produce such an isomorphism) so that the value for each node in M can be perfectly computed from each node in N. Furthermore, for every possible input, the neural networks M,N should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about 10−13, and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let A,B,C be training data sets. Let M,N be neural networks. First train M on A and N on B. Then slowly switch the training sets, so that we eventually train both M and N on just C. After fully training M and N, one should be able to train an isomorphism between the networks M and N (here I assume that M and N are designed properly so that they can produce such an isomorphism) so that the value for each node in M can be perfectly computed from each node in N. Furthermore, for every possible input, the neural networks M,N should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about 10−13, and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
https://dash.harvard.edu/bitstream/handle/1/37364688/SORENSEN-SENIORTHESIS-2020.pdf?sequence=1
Or maybe one can find a more complicated isomorphism between neural networks since a node permutation is quite oversimplistic.
I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.