Deep neural networks are not opaque.

In his List of Lethalities, Eliezer writes that “matrices [in neural networks] are opaque”. But I think our understanding of how neural networks actually work has improved significantly last year, with the deep learning theory book. I will briefly outline the general idea of the book below, in case closing this knowledge gap inspires any viable path towards alignment.

Everything that follows is paraphrased from Roberts, Yaida, and Hanin’s book, unless I indicate otherwise. To understand the book, you need decent linear algebra and analysis skills, with a sprinkle of information theory. Most of the difficulty comes from the length of the calculations, which requires some practice to get used to. Luckily, the book is exceptionally well written and guides the reader through the math step by step.

Effective theories

Artificial neural networks are usually described in terms of activations, weights, and biases. This is the “atomic” view. But just like describing a gas in terms of the individual atoms that it is made of, the atomic view of neural networks is too fine-grained for most practical purposes. To understand neural networks, we need to know the effective degrees of freedom of the network—akin to temperature, pressure, and volume of a gas. This is not just practical, but it is essential for a true understanding of anything. You haven’t fully understood a gas until you have derived the ideal gas law—even if you know all about the quantum field theory that ultimately implies it.

A beautiful thing about nature is that effective theories work. While nature always operates on the lowest level, most of these low-level things often don’t matter for the high-level description. Physicists use perturbation theory and “renormalisation group flow” to successfully link the low-level description (e.g. atoms) with the high-level one (e.g. a gas). This not only gives you better concepts to work with at large scales, but also tells you when those concepts will break down and what to do in this case. The idea in the book is to apply the physicist’s tools to find an effective theory description of neural networks.

The ensemble

The weights and biases of a neural network are typically initialised randomly before training. During training, all of these parameters are updated (e.g. step by step via gradient descent) until the network produces a desired output from a given input. The function that the trained network represents therefore usually depends on the particular initialisation that we started with[1]. To fully understand neural networks, it is essential to think in terms of an ensemble of networks over all possible initialisations, instead of a single trained network.

So we are dealing with a probability distribution over network outputs (those of the final layer or any intermediate ones), given an input and an initialisation distribution. To build an effective theory of a feed-forward network, Roberts et al. mainly do three things:

  1. Start with the infinite layer width limit and then do perturbation theory on the width to get to large but finite width

  2. Use the recursion relation between layers

  3. Marginalise over the parameter initialisation distribution (whenever this is desirable)

All the math is pretty standard in physics, and as far as I can tell it all checks out (but I haven’t done all the calculations myself, yet). What we get in the end is an effective description of a feed-forward network that lets you derive all kinds of interesting things.

Over-parametrisation and generalisation

Modern deep neural networks are usually trained to convergence, where the training error vanishes. Here, we are working in the over-parametrised regime, where the network has many more parameters than are necessary to describe the training data. (See pp. 391 of the book for why this is not in conflict with Occam’s razor.) As a result, the loss function (what we try to minimise during training) does not have a single global minimum, but a high-dimensional sub-manifold of global minima. The art of building and training neural networks lies in finding not just any global minimum of the loss function, but the global minimum that also minimises the error on the unseen test data.

Roberts et al. express this error on the test set explicitly with the effective theory and show how it is affected by the choice of initialisation and learning algorithm. Specifically, they show how an object called the neural tangent kernel is the main driver of the function-approximation dynamics and how its components determine the generalisation behaviour of the model. They also show explicitly how and why representation learning works.

The theory described in the book is derived for feed-forward and residual networks, but the same techniques should apply to transformers and any other kind of network that has some recursive /​ modular structure. Doing this is a lot of hard work, but it can be fun if you like physics.

Dealing with the distributional shift

The question is now if we can express the change of a network’s behaviour under distributional shift of the input data. If we can do this, maybe we can say something that might help with alignment. Having said this, such a research direction is probably advancing capability more than alignment, as usual.

  1. ^

    In the over-parametrised, non-convex regime that we are interested in.