This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability—i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the ‘boundaries’ of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.
Yes that makes a lot of sense that linearity would come hand in hand with generalization. I’d recently been reading Krotov on non-linear Hopfield networks but hadn’t made the connection. They say that they’re planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn’t succeed but then the article also says:
This idea has been further extended in 2017 by showing that a careful choice of the activation function can even lead to an exponential memory storage capacity. Importantly, the study also demonstrated that dense associative memory, like the traditional Hopfield network, has large basins of attraction of size O(Nf). This means that the new model continues to benefit from strong associative properties despite the dense packing of memories inside the feature space.
which perhaps corresponds to them also being able to find good linear representation and to mix generalization and memorization like a transformer?
This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability—i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the ‘boundaries’ of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.
Yes that makes a lot of sense that linearity would come hand in hand with generalization. I’d recently been reading Krotov on non-linear Hopfield networks but hadn’t made the connection. They say that they’re planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn’t succeed but then the article also says:
which perhaps corresponds to them also being able to find good linear representation and to mix generalization and memorization like a transformer?