Regularization Causes Modularity Causes Generalization

Epistemic Status: Exploratory

Things That Cause Modularity In Neural Networks

Modularity is when a neural network can be easily split into several modules: groups of neurons that connect strongly with each other, but have weaker connections to outside neurons. What, empirically, makes a network become modular? Several things:

  • Filan et al.[1]:

    • Training a model with dropout

    • Weight pruning

    • L1/​L2 regularization

  • Kashtan & Alon: Switching between one objective function and a different (but related[2]) objective function every 20 generations

  • Clune et al.: Adding penalties for connections between neurons

Modularity Improves Generalization

What good is modularity? Both Clune et al. and Kashtan & Alon agree: more modular networks are more adaptable. They make much more rapid progress towards their goals than their non-modular counterparts do:

Modular neural networks, being more adaptable, make faster progress towards their own goals. Not only that, but their adaptability allows them to rapidly advance on related[2:1] goals as well; if their objective function was to suddenly switch to a related goal, they would adapt to it much quicker than their non-modular counterparts.

In fact, modular neural networks are so damn adaptable that they do better on related goals despite never training on them. That’s what generalization is: the ability to perform well at tasks with little to no previous exposure to them. That’s why we use L1/​L2 regularization, dropout, and other similar tricks to make our models generalize from their training data to their validation data. These tricks work because they increase modularity, which, in turn, makes our models better at generalizing to new data.

How Dropout Causes Modularity

What’s true for the group is also true for the individual. It’s simple: overspecialize, and you breed in weakness. It’s slow death.

—Major Kusanagi, Ghost in the Shell

Training with dropout is when you train a neural network, but every neuron has a chance of ‘dropping out’: outputting zero, regardless of its input. In practice, making 20-50% of your model’s neurons spontaneously fail during training usually makes it much better at generalizing to previously unseen data.

Ant colonies have dropout. Ants die all the time; they die to war, to famine, and to kids with magnifying glasses. In response, anthills have a high bus factor. Not only do anthills have specialist ants that are really good at nursing, foraging, and fighting, they also have all-rounder ants that can do any of those jobs in an emergency:

Dropout incentivizes robustness to random module failures. One way to be robust to random module failures is to have modules that have different specialties, but can also cover for each other in a pinch. Another way is to have a bunch of modules that all do the exact same thing. For a static objective function, from the perspective of an optimizer:

  • If you expect a really high failure rate (like 95%), you should make a bunch of jack-of-all-trades modules that’re basically interchangeable.

  • If you expect a moderate failure rate (like 30%), you should make your modules moderately specialized, but somewhat redundant. Like ants!

  • If you expect no failures at all, you should let modules be as specialized as possible in order to maximize performance.

    • Do that, and your modules end up hyperspecialized and interdependent. The borders between different modules wither away; you no longer have functionally distinct modules to speak of. You have a spaghetti tower.

    • Why would modules blur together? “Typically, there are many possible connections that break modularity and increase fitness. Thus, even an initially modular solution rapidly evolves into one of many possible non-modular solutions.” —Design Principles of Biological Circuits (review, hardcopy, free pdf)

Dropout is performed on neurons, not “modules” (whatever those are), so why does this argument even apply to neural networks? Modules can have sub-modules, and sub-modules can have sub-sub-modules, so (sub-)-modules are inevitably going to be made up of neurons for some value of . The same principle applies to each level of abstraction: redundancy between modules should increase with the unreliability of those modules.

So dropout incentivizes redundancy. How does that boost modularity? A system built from semi-redundant modules is more, uh, modular than an intricately arranged spaghetti tower. Not all functionally modular systems have redundant elements, but redundant systems have to be modular, so optimization pressure towards redundancy leads to modularity, which leads to generalization.

How L1/​L2 Regularization Causes Modularity

L1/​L2 regularization makes parameters pay rent. Like dropout, L1/​L2 regularization is widely used to make neural networks generalize better. L1 regularization is when you add a term to the objective function that deducts points proportional to the sum of the magnitudes of all of a model’s parameters. L2 regularization is the same thing, but you square the parameters first, and take the square root at the end.

The primary effect of L1/​L2 regularization is to penalize connections between neurons, because the vast majority of neural network parameters are weights, or connections between two neurons. Weight pruning, the practice of removing the ‘least important’ weights, also has a similar effect. As we know from Filan et al., L1/​L2 regularization and weight pruning both increase the modularity of neural networks.

Connection costs don’t just increase the modularity of artificial neural networks. They increase modularity for biological neural networks too! From Clune et al.:

The strongest evidence that biological networks face direct selection to minimize connection costs comes from the vascular system and from the nervous systems, including the brain, where multiple studies suggest that the summed length of the wiring diagram has been minimized, either by reducing long connections or by optimizing the placement of neurons. Founding and modern neuroscientists have hypothesized that direct selection to minimize connection costs may, as a side-effect, cause modularity.

The authors of this paper then go on to suggest that all modularity in biological networks is caused by connection costs. Whether or not that’s true[3], it’s clear that optimizers that penalize connections between nodes produce more modular networks. Natural selection and ML researchers both happened upon structures with costly connections, and both found them useful for building neural networks that generalize.


  1. ↩︎

    One other thing that increases modularity is just training a neural network; trained networks are more modular than their randomized initial states.

  2. ↩︎↩︎

    When I say “related goals”, I mean goals that share subgoals /​ modular structure with the original goal. See Evolution of Modularity by johnswentworth.

  3. ↩︎

    I highly doubt it.