LDL 2: Nonconvex Optimization

Edit 10/​23: by request I’m going back and putting the images in. It’s a pain that you can’t place images in line! Also I edited the dumb probability calculation to make more sense since I may not be able to write another post today.

My favorite “party theorem” is Gauss-Bonnet. It’s the theorem that tells you why New Yorkers fold their pizza to eat it.

The statement of the Gauss-Bonnet theorem is, basically, for a two-dimensional Riemannian manifold, that the integral of Gaussian curvature over the interior plus the integral of geodesic curvature over the boundary is equal to 2 times pi times the Euler characteristic of the manifold.

The most important consequence of this is that the amount of curvature occurring on a surface is a topological constant. That is, in order to change “how curved” something is you have to tear or crease it, you can’t just stretch or bend it.

Now hopefully anyone can tell you that pizza, to begin with, is flat.

But as most people who have eaten pizza can also tell you, it doesn’t always stay flat.

(note: there aren’t very many pictures of toppings falling off pizza around so I had to use one that isn’t labeled for reuse—if this post starts getting lots of views please remind me to remove this image)

And this can cause all the toppings to fall off.

Is the pizza tearing or folding? No, usually it’s just bending and stretching. So if that can’t change how curved it is, why can the toppings fall off your pizza?

The answer is that Gaussian curvature, the ‘correct’ kind of curvature in the sense that it’s the kind that is topologically constant over the surface of the manifold, is basically the “curvature in both directions,” when looking at a surface.

When a pizza is flat, you can choose any direction, and it’s not curved, so 0. Then you choose the orthogonal direction, and it’s still not curved, so that’s 0 too. Then multiply those to get Gaussian curvature, and you get 0.

Conveniently, if you multiply ANYTHING by zero, you get zero. So all you need for your pizza to work out is, basically, one direction where the pizza isn’t curved, and it can curve as much as it wants in the other direction and still be “flat” since the curvature is being multiplied by zero.

But you can exploit this! If you pick a direction and purposefully curve your pizza in that direction, then in order to be “flat” in the Gaussian curvature sense, the pizza will be pushed to stay up in the other direction (unless it tears, which is always a possibility)

This is very convenient if you want to keep all that cheese on your pizza!

But if you’re a hyperintelligent n-dimensional shade of the color blue, this won’t help you.

Imagine you’re trying to eat a slice of three dimensional pizza, and you fold it in one direction to give one of it’s Laplacian eigenvalues a very high value. Then a second Laplacian eigenvalue will have to be zero, to cancel this out and maintain the zero Gaussian curvature of your hyperpizza. But this three dimensional hyperpizza has a third Laplacian eigenvalue! So while you hold your hyperpizza up by the crust in one dimension, and that pulls it up in a second dimension, all the toppings fall off in the third direction.

This brings us to the reason I’m talking about Gaussian curvature today.

In the ancestral environment, machine learning engineers cared a lot about avoiding local optima in their problems.

These local optima are places where the local derivatives are all zero, and even if you move a little bit nearby, you’ll end up rolling right back to them. Think about climbing to the highest peak; if you’re right next to a lower peak you’ll probably climb up that one instead of trying to find a different peak (at least if you’re an ML algorithm executing gradient descent).

One of the things that makes local optima so worrying is that, in some situations, you may have to go a long way to get to a different local optimum, and you’ll still be uncertain about whether you’re at the best one!

Those of you who remember your first calculus classes may recall that there are two ways for a local flat area to arise: at a peak or at a valley (or an inflection point but these should be extremely rare). If you’re trying to find the lowest place on the map, then a peak isn’t going to give you too much of a problem—you’ll move a little bit away and start rolling down the hill. There are plenty of ways to implement this in ML; things like random minor perturbation of your weights or data augmentation with noise or just normal tricks like Adam optimization.

This is where the pizza metaphor comes back. In the ancestral environment, machine learning engineers worked with things like pizza or three dimension hyperpizza. And they were really trying to get those toppings to fall off. But if there was one hand holding up a part of the pizza, then they couldn’t get moving, and if there was a second hand holding up the hyperpizza they were also stuck.

But as we’ve progressed in time and computational power and data size, we end up mostly working in very high dimensional spaces these days. A typical deep neural net can have tens of thousands of parameters, so our gradient descent is happening in a ten thousand dimensional space!

Now, there are a lot of steps that occur in training a neural network, and I don’t want to assume anything about the problem like the idea that peaks and valleys in different directions are independently likely. But if they are, and if the odds are 5050 in any direction that you’ll find a peak versus a valley, then to have a 10% chance of encountering a true local optimum you’ll need to encounter N critical points, where N satisfies:

[1 - (.5^10000)]^N = .1

N log(1 - (.5^10000)) = log(.1)

N = log(.1)/​log(1-.5^10000)

.5^1000 is about 10^-302, so .5^10000 is about 10^-3000, which google calculator refuses to keep in memory for me, but it looks like log(.1)/​log(.999...) is about 2.3*10^(number of nines) so we can just flip that and say you’d need to encounter 10^3000 critical points before expecting one to be a local optimum (IF your features are independent—they usually aren’t but that’s a different problem).

In conclusion, getting stuck in local optima isn’t really a problem for deep learning. Local optima are just so hard to come by, that you don’t really expect to see them at all.

Coda

So personally I don’t think that convex optimization is a very interesting part of deep learning. It’s something that’s been studied extensively and implemented in low level C code and everyone has built specialized hardware for doing all the linear algebra, and I don’t feel like it’s under-studied or like it’s going to lead to any exciting breakthroughs. I’m glad people are working on it, but I don’t care that much about thinking about it myself. But I do like the Gauss-Bonnet pizza story and optimization is what’s been happening in the Coursera course and I want to write every day so here you go.