This essay consists of summaries, explanations, and discussions of several papers which provide high-level arguments and intuitions about why, conceptually, deep learning works. Particular areas of investigation are “Which classes of functions can deep neural nets approximate well in principle?”; “Why can they quickly learn functions which have very small training loss?”; and “Why do the functions they learn generalise so well?”.
Understanding deep learning requires rethinking generalisation—Zhang, Bengio, Hardt, Recht and Vinyals, 2017
Machine learning in general is about identifying functions in high-dimensional spaces based on finitely many samples from them. In doing so, we navigate between two potential errors: learning a function which is too simple to capture most of the variation in our data (underfitting) and learning a function which matches the data points well, but doesn’t generalise (overfitting). Underfitting implies higher training error; overfitting implies higher test error. Attempting to avoid underfitting is called optimisation, and attempting to avoid overfitting is called regularisation. We can represent a particular neural network with w weights as a w-dimensional space; when training it, we are trying to find the point in that space which most closely corresponds to the true underlying function. A loss function is a heuristic based on our data which tells us approximately how close that correspondence is; one reason that training is difficult is because good loss functions are generally not convex, but rather have some local minima, which a training method like gradient descent has difficulty escaping. However, we can get better results with stochastic gradient descent (SGD): the randomness provided by only updating some variables allows more chance of escaping local minima, in a way comparable to simulated annealing.  SGD also allows proofs of convergence similar to those for total gradient descent, and is much easier to compute.
Another problem is the fact that because our data are finite, there are many models which have very low loss but are very far from the truth. In such extreme cases of overfitting, a learner could effectively “memorise” every piece of input data without their model capturing any of the underlying patterns. An easy way to avoid this is to use a model which doesn’t have enough “memory” to store all the points; this is a form of regularisation. A simple 2-dimensional example: if we want to plot a polynomial of order 100 to fit 1000 data points, the 101 variables we can change are not enough to store all 1000 points, and so if our model has low training error it must be because it’s capturing some pattern in the input data. However, it turns out that this isn’t the case for neural networks: the first result of this paper can be summarised as “deep neural nets easily fit random labels”. The authors found that on both standard image inputs with randomised labels, and inputs consisting of random noise with random labels, their neural networks achieved zero training error! Since there’s (almost) no pattern in such data, this implies that a neural network can essentially memorise a large number of inputs. (Of course, when trained on random labels the test error will be very high.) In terms of number of parameters alone, this is no surprise, since neural networks are typically overparameterised (they have many more weights than training examples). However, it’s surprising that backpropogation can quickly converge to such a detailed representation. In fact, the authors found that there was very little difference between randomised and non-randomised inputs in terms of training time required to reach given levels of training error.
Why do neural networks generalise well in practice, then, instead of just memorising their inputs and gaining no predictive power? (See this paper  for experimental evidence that memorisation isn’t what is happening). One possibility is our use of regularisation. But the authors identify that neural networks can generalise even when no explicit forms of regularisation (such as dropout or weight decay) are used. Their answer: that stochastic gradient descent itself is a form of implicit regularisation which hinders memorisation when there are ways to generalise. Why is that? Apparently, SGD tends to find flat minima rather than sharp minima, especially when the batch sizes used for each step of SGD are small.  Flat minima are robust to small deviations in input parameters, which suggests that they will generalise well. We can also think about this from a Bayesian minimum description length perspective: a function represented by a flat minimum is less complex to specify, and therefore should have a higher prior probability.  On the other hand,  find that reparameterisation can change sharp minima to flat ones and vice versa. They argue that ability to generalise shouldn’t be affected by reparameterisation—but I think I disagree, since some metrics are more natural than others.
The original paper also presents another slightly confusing result: that regularisation methods such as weight decay and batch normalisation can actually improve performance on training data. This is strange because making the model less expressive shouldn’t make it more precise. The explanation may be related to the success of sparse representations in the brain, as discussed below. Other forms of explicit regularisation they consider are data augmentation (e.g. adding inputs which are rotated or perturbed versions of others), dropout, and early stopping; they conclude that data augmentation is a more effective regulariser than the others.
Quirks and curses of dimensionality
Often inputs to deep neural networks are in really high-dimensional spaces—such as images with millions of pixels, or one-hot word vectors of length > 100000. Geometry gets weird in those dimensions, and our intuitions are easily led astray. Here are some features of a 1000-dimensional hypercube with side length 1, for instance:
It has 2^1000 corners.
The distance between opposite corners is around 31, even though the distance between adjacent corners is 1.
The volume of a hypersphere inside the cube which touches all its sides is less than 10^-1000 of the cube’s volume.
These are absurdly high numbers. Here’s an attempt to portray 2, 3, 4 and 6-dimensional cubes with spheres inscribed inside them, with the correct number of corners. Note that these images are inevitably misleading: in reality, you can get from any corner to any other corner without leaving the hypercube, and of course the corners aren’t coplanar. But at least it should give you a better intuition for the issue—and remember, even image d represents a cube with only 6 dimensions.
What conclusions can we draw? Firstly, that measuring Euclidean distances between points isn’t very useful. We’ve already seen that hyperspheres with diameters comparable to a hypercube’s side length end up occupying a negligible portion of that hypercube’s volume, and so almost no points will be “close” to any other; interestingly, the opposite is also true—almost no points will be nearly as far apart as opposing corners. Simulations suggest that distances between random points in high dimensions cluster very tightly around 41% of the distance between opposing corners; also, that the angles between those points and the origin cluster very close to 90 degrees, so that dot-product similarity is less useful.  In fact, for data points drawn from reasonable distributions, the ratio between the distance to their nearest neighbour and the distance to their furthest neighbour will approach 1 in high dimensions. Algorithms to find nearest neighbours are much less efficient in high dimensions, and tend to require linear time for each query, because indexing is no longer viable. (This difficulty can be alleviated somewhat by measuring proximity using Manhattan distance—i.e. the L1 metric—or fractional metrics less than 1, which work even better.) 
In general, these sorts of negative effects are often referred to as the “curse of dimensionality”. The most obvious way to ameliorate them is to use a dimensionality-reducing technique before processing the rest of the data. Algorithms such as principal component analysis (PCA) pick out an orthonormal basis which explains as much of the variance as possible; more generally, we can apply a kernel before doing PCA (which becomes computationally practical when using the kernel trick). Neural networks can also be used for dimensionality reduction, in the form of autoencoders. In CNNs specifically, pooling layers are a form of gradual dimensionality reduction. However, their benefits are controversial.
What’s wrong with convolutional nets? - Hinton, 2014
Geoffrey Hinton in particular has spoken out against the usefulness of pooling, for several reasons.  (Following Hinton, I use “pose” to refer to an object’s position, orientation, and scale).
It is a bad fit to the psychology of shape perception. Humans’ perceptions of objects are based on the rectangular coordinate frames we impose on them; CNNs don’t use coordinate frames.
We want equivariance to viewpoint, not invariance. CNNs can recognise objects by throwing away pose data; whereas humans recognise objects from different viewpoints while also knowing their pose.
It doesn’t use the underlying linear structure. CNNs don’t have a built-in way of generalising to different viewpoints, like humans do; they instead need training data from many perspectives. But extrapolating what a changed perspective looks like shouldn’t be hard, it’s just matrix multiplication of representations of 3D objects!
Pooling is a poor way to do dynamic routing. In different images, different sets of pixels should be grouped together (basically, factorising an image into objects; apparently, humans do this using eye movement?). But CNNs with max-pooling networks only route parts of lower-level representations to higher-level representations based on how active those parts are. Instead, Hinton wants the higher-level representations to give feedback to lower-level representations. For example, if we want to group representations of primitive shapes at one level into representations of objects at the next level, we don’t know whether a circle is a human face, or a car wheel, or a ball. So we send it up to to all three object representations, and they check how well it fits, and it is assigned based on those assessments. Note that this is similar to how predictive processing works in the human brain. 
Hinton’s solution is capsule networks, which explicitly store and reason about the poses of objects that they see, as well as using the feedback mechanisms I mentioned above. They’ve performed well on some tasks so far, but have yet to prove themselves more generally. Either way, I think this example illustrates a good meta-level point about the usefulness of deep learning. Bayesian methods are on a sounder theoretical footing than deep learning, and knowing them gives us insights into how the brain works. But knowing how the brain works doesn’t really help us build better Bayesian methods! Deep learning serves as an intermediate step which is informed both by abstract mathematical considerations, and by our knowledge of the one structure—the human brain—that already has many of the capabilities we are aiming for.
Learning Deep Architectures for AI—Bengio, 2009
Another way of thinking about deep neural networks is that each layer represents the input data at a higher level of abstraction. We have good reasons to believe that human brains also feature many layers of abstraction, especially in the visual system. The parallel is particularly clear in CNNs trained for vision tasks, where early layers identify features such as edges, then primitive shapes built out of edges, then complex shapes built out of primitive shapes, in a similar way to human brains. Another example is the hierarchical structure of programs, which often have subfunctions and subsubfunctions. In each case, having more layers allows more useful abstractions to be built up: we can abstractly represent a picture with millions of pixels using a short description such as “man walking in a park”. This relies on each layer being nonlinear, since arbitrarily many layers of linear functions are no more powerful than one layer. The key advantage of neural networks is that humans do not have to hand-code which abstraction should be used at which layer; rather, the neural network deduces them automatically given the input and output representations. Once it has made these deductions when training on some input data (e.g. distinguishing cats and dogs), it should theoretically be easier to train it to recognise a new category (such as pigs), since it can “describe” that category in terms of the features it already knows.
The representations we find in the brain have two key features: they are distributed (multiple neurons in a representation can be active at once) and sparse (only 1-4% of neurons are active at any time). The number of patterns a distributed representation can distinguish is exponential in the dimension of the representation (as opposed to non-distributed one-hot encodings, which are only linear). One key aspect here is the idea of local vs non-local generalisation. In general, local models require two steps. First they check how much a data point matches each “region” of the function it has learned; then they combines values for each region, depending on how well they match. Two examples are k-nearest-neighbours, and kernel machines using the Gaussian kernel. Although boundaries between regions may be fuzzy, intuitively local models are dividing up a function “piecewise” and then describing each piece separately. However, this doesn’t allow us to learn a function which has many more regions than we have training examples. For example, kernel machines with a Gaussian kernel require a number of examples linear in the number of “bumps” in the target function (one bump is the function starting positive, then becoming negative, then positive again). Instead, we need to learn representations which are less local, such as more abstract and distributed representations.
Learning multiple layers of distributed representations from scratch is a challenge for neural networks, though. One way to improve performance is by pre-training one layer at a time in an unsupervised fashion. We can reward the first layer for producing similar outputs when given similar inputs (using, for instance, a Euclidean metric); once it is trained, we can do the same for the second layer, and so on. It might be expected that by doing this, we will simply teach each layer to implement the identity function. But in fact, because of implicit regularisation, as discussed above, and the fact that implementing the identity function requires very small weights in the first layer, in practice pre-trained layers actually end up with different representations. Without pre-training, the top two layers alone are able to get very low training error even if the lower layers produce bad representations; but when we limit the size of the top two layers, pre-training makes a major difference—suggesting that it is particularly important in lower layers, perhaps because the signal from backpropogation is more diffuse when it reaches them.
Various expressibility proofs
Surprisingly, adding more layers doesn’t change which functions a neural network can learn. In general, any continuous function can be approximated arbitrarily closely by a neural network with only one hidden layer. However, that layer may need to be very, very wide.  In general, for a k-layer network to be able to express all the same functions as a (k+1)-layer network, it may require exponentially-wider layers! A simple example is the parity function, which can only be computed in a 2-layer network if it is exponentially wide in the number of inputs (this is intuitively because the function has many “regions” in the sense described above). One useful intuition here is to view a deep and narrow network as “factorising” a shallow but wide network: a term can be computed in a lower layer and then referred to many times in higher layers, instead of being separately computed many times.
There are also several techniques we can think of as equivalent to adding another layer. For example, transforming inputs according to a fixed kernel can be thought of as adding one (more powerful) layer at the bottom. Boosting, a technique to aggregate the results of sub-models, adds an additional layer at the top. These can reduce the necessary width by orders of magnitude—for example, boosted ensembles of decision trees can distinguish between exponentially many regions (intuitively, if each decision tree divides the input space in half along a different dimension, then adding another decision tree could double the number of separate regions).
In , the authors argue that it’s not the space of all functions which we should be interested in, but rather the expressive power of neural networks on a finite sample of size n. They prove that there exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function f : S->R, where |S| = n and each element of S has d dimensions. This is a much nicer bound than above; however, I’m not sure how relevant it is. In reality, we will always need to process inputs which aren’t exactly the ones that we’ve trained on. It’s true that we can often just treat novel inputs similarly to others near them that we’ve already learned—but for reasons I explained above, to get good coverage in high-dimensional spaces n would need to be astronomically large.
Lastly, a CNN with a last hidden layer which is wider than the number of inputs can a) learn representations for each input which are linearly independent of each other, and b) achieve zero loss on those inputs. Both of these properties hold with probability 1 even if all previous layers have random values.  And in fact several CNN setups which have achieved top results do have layers this wide. This helps explain the memorisation results in , but if anything makes it harder to explain why our trained models can still generalise well.
However, note that all of these results only show that some combination of weights exists with these properties, not that those weights can be efficiently found.
A representation of programs for learning and reasoning—Looks and Goertzel, 2008
This paper isn’t directly about deep learning, but it provides an interesting framework which I hadn’t considered before. We often write programs to search over some space of possibilities. But we can think about the process of writing a program itself as searching through the space of all possible character strings for one that will implement a desired function. This paper explores how we might automate that search. Some preliminaries: we distinguish the syntax of a program (the string of characters of which it’s composed) from the semantics of a program (the function that it encodes). A “syntactically valid” program is one which compiles, and therefore has a semantic “meaning”.
First, the paper discusses some of the key features of programs: that they are well-specified (unambiguous), compact (they allow us to specify functions more tersely than by explicating input-output pairs), combinatorial (programs can rearrange the outputs of other programs) and hierarchical (can be decomposed into subprograms). Some of the properties that make reasoning about them challenging: open-endedness (programs can be arbitrarily large), over-representation (syntactically distinct programs may be semantically identical), chaotic execution (programs that are very syntactically similar may be very semantically different), and high resource-variance (time and space requirements of syntactically similar programs may vary widely).
Now let’s say that we want to search for a program which has certain semantic properties. If our search space is over all possible programs, and we consider building them up character by character, we very quickly get combinatorial explosion. Intuitively, this is very inefficient: the vast majority of possible character strings simply don’t compile. It’s also difficult to find good heuristics to direct such a search—there are many invalid programs which differ from a correct solution by only one character. What we want is a way to represent programs at a higher level than just characters, so that we can specify all possible ways to change a program so that it’s still syntactically valid, and then conduct a search through the resulting space. This is similar to lambda calculus or formal logic, in which there are several fixed ways to create an expression out of other expressions, the result of which will always be syntactically valid. In lambda calculus and formal logic, every expression has a “normal form” with the same semantics. The authors propose that each program should similarly be represented by a normal form that preserves its hierarchical structure; from these, we can find “reduced normal forms” which are even more useful, by simplifying based on reduction rules. They claim that a representation with those properties will be more useful than others even if these representations are all capable of expressing the same programs (in the same way that it’s sometimes more useful to use one-hot encodings for words, rather than representing them as strings of characters).
This paper presents a proof-of-concept normal form. Primitive types are Booleans and numbers; then there are data types parameterised by primitive types, like lists and tuples. Each has elementary functions; the elementary functions of lists are append, and a constructor. There are also general functions, such as the one which returns the nth element of a tuple. Once we’ve represented a program in this way, we can compress it using canonical reduction rules, which decrease the size of a program without changing the semantics (a basic example could be “if a variable is declared with one value, then immediately changed, then you can reduce these two lines by just declaring it with the latter value directly”). We can then conduct a search starting from any given program by applying transformations to that program. Transformations come in two types: “neutral” (which preserve semantics) and “non-neutral” (which may not). Examples of non-neutral transformations include adding elements to a list, composing one function with another, and use of the fold function. Neutral transformations include abstraction and inserting an if-statement where the condition is always true. Such neutral transformations aren’t useful by themselves, but they may later allow desirable non-neutral transformations (e.g. one which then changes the tautological if-statement condition to something more useful). By only allowing certain well-specified transformations, we narrow down the search space of all possible character strings to the search space of all syntactically valid programs in normal form. The latter space is still massive, but much much smaller than the former. The hope is that programs in normal form also have more obvious correlations between syntax and semantics, so that we can create good search heuristics without having to evaluate the actual “fitness” of every program we create by running it on many inputs, or proving that it works.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalisation. https://arxiv.org/abs/1611.03530
Leon Bottou. 1991. Stochastic gradient learning in neural networks. http://leon.bottou.org/publications/pdf/nimes-1991.pdf
David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S. Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, Aaron Courville. 2017. Deep nets don’t learn via memorisation. https://openreview.net/pdf?id=rJv6ZgHYg
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. 2017. On large-batch training for deep learning: generalisation gap and sharp minima. https://openreview.net/pdf?id=H1oyRlYgg
Ferenc Huszar. 2017. Everything that works works because it’s Bayesian: why deep nets generalise? http://www.inference.vc/everything-that-works-works-because-its-bayesian-2/
Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio. 2017. Sharp minima can generalise for deep nets. https://arxiv.org/pdf/1703.04933.pdf
Martin Thoma. 2016. Average distance of random points in a unit hypercube. https://martin-thoma.com/average-distance-of-points/
Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim. 2001. On the surprising behaviour of distance metrics in high dimensional space. https://bib.dbvis.de/uploadedFiles/155.pdf
Geoffrey Hinton. 2014. What’s wrong with convolutional nets? http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-convolutional-nets
Scott Alexander. 2017. Book review: Surfing Uncertainty. http://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/
Yoshua Bengio. 2009. Learning deep architectures for AI. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf
Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. https://www.sciencedirect.com/science/article/pii/089360809190009T?via%3Dihub
Quynh Nguyen, Matthias Hein. 2017. The loss surface and expressivity of deep convolutional neural networks. https://arxiv.org/abs/1710.10928
Moshe Looks, Ben Goertzel. 2008. A representation of programs for learning and reasoning. https://pdfs.semanticscholar.org/67cf/4ddc0cf5c12654273c2b0dd9e8121673d940.pdf