As someone who’s been working in the ML field for ~5 years, there are pieces of what 5 years ago were common folk-wisdom about training AI that everyone knew but most people were very puzzled by (e.g. “the loss functions of large neural nets have a great many local minima, but they all seem to have about the same low level of the loss function, so getting trapped in a local minimum isn’t actually a significant problem, especially for over-parameterized networks”) that now have a simple, intuitive explanation from Singular Learning Theory (many of those local minima are related by a large number of discrete-and-continuous symmetries, so have identical loss function values), plus we now additionally understand why some of them have slightly better and worse values of the loss function and under which circumstances a model will settle at which of them, and how that relates to Occam’s Razor, training set size, generalization, internal symmetries, and so forth. We have made some significant advances in theoretical understanding. So a field that used to be like Alchemy, almost entirely consisting of unconnected facts and folklore discovered by trial and error, is starting to turn into something more like Chemistry with some solid theoretical underpinnings. Yes, there’s still quite a ways to go, and yes timelines look short for AI Safety, so really I wish we had more understanding as soon as possible. However, I think “we know very little about” was accurate a few years ago, but has since becoming an understatement.
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let A,B,C be training data sets. Let M,N be neural networks. First train M on A and N on B. Then slowly switch the training sets, so that we eventually train both M and N on just C. After fully training M and N, one should be able to train an isomorphism between the networks M and N (here I assume that M and N are designed properly so that they can produce such an isomorphism) so that the value for each node in M can be perfectly computed from each node in N. Furthermore, for every possible input, the neural networks M,N should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about 10−13, and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.
As someone who’s been working in the ML field for ~5 years, there are pieces of what 5 years ago were common folk-wisdom about training AI that everyone knew but most people were very puzzled by (e.g. “the loss functions of large neural nets have a great many local minima, but they all seem to have about the same low level of the loss function, so getting trapped in a local minimum isn’t actually a significant problem, especially for over-parameterized networks”) that now have a simple, intuitive explanation from Singular Learning Theory (many of those local minima are related by a large number of discrete-and-continuous symmetries, so have identical loss function values), plus we now additionally understand why some of them have slightly better and worse values of the loss function and under which circumstances a model will settle at which of them, and how that relates to Occam’s Razor, training set size, generalization, internal symmetries, and so forth. We have made some significant advances in theoretical understanding. So a field that used to be like Alchemy, almost entirely consisting of unconnected facts and folklore discovered by trial and error, is starting to turn into something more like Chemistry with some solid theoretical underpinnings. Yes, there’s still quite a ways to go, and yes timelines look short for AI Safety, so really I wish we had more understanding as soon as possible. However, I think “we know very little about” was accurate a few years ago, but has since becoming an understatement.
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let A,B,C be training data sets. Let M,N be neural networks. First train M on A and N on B. Then slowly switch the training sets, so that we eventually train both M and N on just C. After fully training M and N, one should be able to train an isomorphism between the networks M and N (here I assume that M and N are designed properly so that they can produce such an isomorphism) so that the value for each node in M can be perfectly computed from each node in N. Furthermore, for every possible input, the neural networks M,N should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about 10−13, and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
https://dash.harvard.edu/bitstream/handle/1/37364688/SORENSEN-SENIORTHESIS-2020.pdf?sequence=1
Or maybe one can find a more complicated isomorphism between neural networks since a node permutation is quite oversimplistic.
I gather node permutation is only one of the symmetries involved, which include both discrete symmetries like permutations and continuous ones such as symmetries involving shifting sets of parameters in ways that produce equivalent network outputs.
As I understand it (and I’m still studying this stuff), the prediction from Singular Learning Theory is that there are large sets of local minima, each set internally isomorphic to each other so having the same loss function value (modulo rounding errors or not having quite settled to the optimum). But the prediction of SLT is that there are generically multiple of these sets, whose loss functions are not the same. The ones whose network representation is simplest (i.e. with lowest Kolmogorov complexity when expressed in this NN architecture) will have the largest isometry symmetry group, so are the easiest to find/are most numerous and densely packed in the space of all NN configurations. So we get Occam’s Razor for free. However, typically the best ones with the best loss values will be larger/more complex, so harder to locate. That is about my current level of understanding of SLT, but I gather that with a large enough training set and suitable SGD learning metaparameter annealing one can avoid settling in a less good lower-complexity minimum and attempt to find a better one, thus improving your loss function result, and there is some theoretical mathematical understanding of how well one an expect to do based on training set size.