After reading your sequence today, there is one additional hypothesis which came to my mind, which I would like to make the case for (note that my knowledge about ML is very limited, there is a good chance that I am only confused):
Claim: Noise favours modular solutions compared to non-modular ones.
What makes me think that? You mention in Ten experiments that “We have some theories that predict modular solutions for tasks to be on average broader in the loss function landscape than non-modular solutions” and propose to experimentally test this. If this is a true property of a whole model, then it will also (typically) be the case for modules on all size-scales. Plausibly the presence of noise creates an incentive to factor the problem into sub-problems which can then by solved with modules in a more noise-robust fashion (=broader optimum of solution). As this holds true on all size-scales, we create a bias towards modularity (among other things).
Is this different from the mentioned proposed modularity drivers? I think so. In experiment 8 you do mention input noise, which made me think about my hypothesis. But I think that ‘local’ noise might provide a push towards modules in all parts of the model via the above mechanism, which seems different to input noise.
Some further thoughts
even if this effect is true, I have no idea about how strong it is
noisy neurons actually seem a bit related to connection costs to me in that (for a suitable type of noise) recieving information from many inputs could become costly
even if true, it might not make it easier to actually train modular models. This effect should mostly “punish modular solutions less than non-modular ones” instead of actually helping in training. A quick online search for “noisy neural network” indicated that these have indeed been researched and that performance does degrade. My first click mentioned the degrading performance and aimed to minimize the degradation. However I did not see non-biological results after adding “modularity” to the search (didn’t try for long, though).
this is now pure speculation, but when reading “large models tend towards modularity”, I wondered whether there is some relation to noise? Could something like the finite bit resolution of weights lead to an effective noise that becomes significant at sufficient model sizes? (the answer might well be an obvious no)
After reading your sequence today, there is one additional hypothesis which came to my mind, which I would like to make the case for (note that my knowledge about ML is very limited, there is a good chance that I am only confused):
Claim: Noise favours modular solutions compared to non-modular ones.
What makes me think that? You mention in Ten experiments that “We have some theories that predict modular solutions for tasks to be on average broader in the loss function landscape than non-modular solutions” and propose to experimentally test this.
If this is a true property of a whole model, then it will also (typically) be the case for modules on all size-scales. Plausibly the presence of noise creates an incentive to factor the problem into sub-problems which can then by solved with modules in a more noise-robust fashion (=broader optimum of solution). As this holds true on all size-scales, we create a bias towards modularity (among other things).
Is this different from the mentioned proposed modularity drivers? I think so. In experiment 8 you do mention input noise, which made me think about my hypothesis. But I think that ‘local’ noise might provide a push towards modules in all parts of the model via the above mechanism, which seems different to input noise.
Some further thoughts
even if this effect is true, I have no idea about how strong it is
noisy neurons actually seem a bit related to connection costs to me in that (for a suitable type of noise) recieving information from many inputs could become costly
even if true, it might not make it easier to actually train modular models. This effect should mostly “punish modular solutions less than non-modular ones” instead of actually helping in training. A quick online search for “noisy neural network” indicated that these have indeed been researched and that performance does degrade. My first click mentioned the degrading performance and aimed to minimize the degradation. However I did not see non-biological results after adding “modularity” to the search (didn’t try for long, though).
this is now pure speculation, but when reading “large models tend towards modularity”, I wondered whether there is some relation to noise? Could something like the finite bit resolution of weights lead to an effective noise that becomes significant at sufficient model sizes? (the answer might well be an obvious no)