Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I’d expect to be used by real minds, because they’re much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It’s where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there’s still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of 1000 internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like O(e1000c) update steps to learn it.[1] In other words, it won’t do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with 1000 internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size 50, the total learning time will be around O(e50c).[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution’s basin in the loss landscape in a feasible timeframe. It can’t just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it’s still failing to capture a lot of the structure of learning in real minds.
I’m not going to add “I think” and “I suspect” to every sentence in this comment, but you should imagine them being there. I haven’t actually worked this out in math properly or tested it.
At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind’s own decisions as it learns might be yet another complication we’ll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
To get the rough idea across I’m focusing here on the very basic case where the “chunks” are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn’t have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be ≈e50c+e50c+e30c=O(e50c).
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I’d expect to be used by real minds, because they’re much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It’s where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there’s still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of 1000 internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like O(e1000c) update steps to learn it.[1] In other words, it won’t do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with 1000 internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size 50, the total learning time will be around O(e50c).[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution’s basin in the loss landscape in a feasible timeframe. It can’t just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it’s still failing to capture a lot of the structure of learning in real minds.
Where c is some constant which probably depends on the details of the update algorithm.
I’m not going to add “I think” and “I suspect” to every sentence in this comment, but you should imagine them being there. I haven’t actually worked this out in math properly or tested it.
At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind’s own decisions as it learns might be yet another complication we’ll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
To get the rough idea across I’m focusing here on the very basic case where the “chunks” are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn’t have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be ≈e50c+e50c+e30c=O(e50c).