1. Investigate (randomly) modulary varying goals in modern deep learning architectures.

I did a small experiment regarding this. Short description below.

I basically followed the instructions given in the section: I trained a neural network on pairs of digits from the MNIST dataset. These two digits were glued together side-by-side to form a single image. I just threw something up for the network architecture, but the second-to-last layer had 2 nodes (as in the post).

I had two different type of loss functions / training regimes:

mean-square-error, the correct answer being x + y, where x and y are the digits in the images

mean-square-error, the correct answer being ax + by, where a and b are uniformly random integers from [-8, 8] (except excluding the case where a = 0 or b = 0), the values of a and b changing every 10 epochs.

In both cases the total number of epochs was 100. In the second case, for the last 10 epochs I had a = b = 1.

The hard part is measuring the modularity of the resulting models. I didn’t come up with anything I was satisfied with, but here’s the motivation for what I did (followed by what I did):

Informally, the “intended” or “most modular” solution here would be: the neural network consists of two completely separate parts, identifying the digits in the first and second half of the image, and only at the very end these classifications are combined. (C.f. the image in example 1 of the post.)

What would we expect to see if this were true? At least the following: if you change the digit in one half of the image to something else and then do a forward-pass, there are lots of activations in the network that don’t change. Weaker alternative formulation: the activations in the network don’t change very much.

So! What I did was: store the activations of the network when one half of the image is sampled randomly from the MNIST dataset (and other one stays fixed), and look at the Euclidean distances of those activation vectors. Normalizing by the (geometric) mean of the lengths of the activation vectors gives a reasonable metric of “how much did the activations change relative to their magnitude?”. I.e. the metric I used is |v−w|√|v||w|.

And the results? Were the networks trained with varying goals more modular on this metric?

(The rest is behind the spoiler, so that you can guess first.)

For the basic “predict x+y”, the metric was on average 0.68+-0.02 or so, quite stable over the four random seeds I tested. For the “predict ax + by, a and b vary” I once or twice ran to an issue of the model just completely failing to predict anything. When it worked out at all, the metric was 0.55+-0.05, again over ~4 runs. So maybe a 20% decrease or so.

Is that a little or a lot? I don’t know. It sure does not seem zero—modularly varying goals does something. Experiments with better notions of modularity would be great—I was bottlenecked by “how do you measure Actual Modularity, though?”, and again, I’m unsatisfied with the method here.

I did a small experiment regarding this. Short description below.

I basically followed the instructions given in the section: I trained a neural network on pairs of digits from the MNIST dataset. These two digits were glued together side-by-side to form a single image. I just threw something up for the network architecture, but the second-to-last layer had 2 nodes (as in the post).

I had two different type of loss functions / training regimes:

mean-square-error, the correct answer being x + y, where x and y are the digits in the images

mean-square-error, the correct answer being ax + by, where a and b are uniformly random integers from [-8, 8] (except excluding the case where a = 0 or b = 0), the values of a and b changing every 10 epochs.

In both cases the total number of epochs was 100. In the second case, for the last 10 epochs I had a = b = 1.

The hard part is measuring the modularity of the resulting models. I didn’t come up with anything I was satisfied with, but here’s the motivation for what I did (followed by what I did):

Informally, the “intended” or “most modular” solution here would be: the neural network consists of two completely separate parts, identifying the digits in the first and second half of the image, and only at the very end these classifications are combined. (C.f. the image in example 1 of the post.)

What would we expect to see if this were true? At least the following: if you change the digit in one half of the image to something else and then do a forward-pass, there are lots of activations in the network that don’t change. Weaker alternative formulation: the activations in the network don’t change very much.

So! What I did was: store the activations of the network when one half of the image is sampled randomly from the MNIST dataset (and other one stays fixed), and look at the Euclidean distances of those activation vectors. Normalizing by the (geometric) mean of the lengths of the activation vectors gives a reasonable metric of “how much did the activations change relative to their magnitude?”. I.e. the metric I used is |v−w|√|v||w|.

And the results? Were the networks trained with varying goals more modular on this metric?

(The rest is behind the spoiler, so that you can guess first.)

For the basic “predict x+y”, the metric was on average 0.68+-0.02 or so, quite stable over the four random seeds I tested. For the “predict ax + by, a and b vary” I once or twice ran to an issue of the model just completely failing to predict anything. When it worked out at all, the metric was 0.55+-0.05, again over ~4 runs. So maybe a 20% decrease or so.

Is that a little or a lot? I don’t know. It sure does not seem

zero—modularly varying goals doessomething. Experiments with better notions of modularity would be great—I was bottlenecked by “how do you measure Actual Modularity, though?”, and again, I’m unsatisfied with the method here.