This makes some sense, but not totally. The issue I have is: why would you expect the local gradient estimates as accurate then? Like, you could imagine the loss landscape looks something like a unsmoothed loss-over-time curve
And imagine that the up-down squiggles are two facets. Now the gradient estimates will basically just be total nonsense compared with the overall smooth structure of the graph.
Ok, so I went off and thought about it for a couple of hours. This is what I’ve come up with so far:
I think that the output space (and by extension the loss) is probably very spiky with gradients pointing all over the place.
The main reason I think this is that by default relu is a load of 1d cuts through the surface, and so the average angle between adjacent facets is going to be 90 degrees. I think with small facets this makes for a very spiky surface (although probably not quite to the extent that the loss graph you showed would suggest).
I basically expect that this bumpy surface averages out to something vaguely smooth corresponding more directly to positions in parameter space which are actually high or low loss.
I think that this is where large batch sizes and momentum come in: you’re effectively averaging over all the bumps.
This also probably helps explain why warmup learning rates are often used? A few iterations are needed to get the averaging effects of the momentum in the right ballpark before we can actually move in the correct direction.
I still think that the facets are going to be small enough that you’re going to consistently hop between them, but yeah this does make me think that neural nets are a load more noisy than I’d previously considered.
This makes some sense, but not totally. The issue I have is: why would you expect the local gradient estimates as accurate then? Like, you could imagine the loss landscape looks something like a unsmoothed loss-over-time curve
And imagine that the up-down squiggles are two facets. Now the gradient estimates will basically just be total nonsense compared with the overall smooth structure of the graph.
Ok, so I went off and thought about it for a couple of hours. This is what I’ve come up with so far:
I think that the output space (and by extension the loss) is probably very spiky with gradients pointing all over the place.
The main reason I think this is that by default relu is a load of 1d cuts through the surface, and so the average angle between adjacent facets is going to be 90 degrees. I think with small facets this makes for a very spiky surface (although probably not quite to the extent that the loss graph you showed would suggest).
I basically expect that this bumpy surface averages out to something vaguely smooth corresponding more directly to positions in parameter space which are actually high or low loss.
I think that this is where large batch sizes and momentum come in: you’re effectively averaging over all the bumps.
This also probably helps explain why warmup learning rates are often used? A few iterations are needed to get the averaging effects of the momentum in the right ballpark before we can actually move in the correct direction.
I still think that the facets are going to be small enough that you’re going to consistently hop between them, but yeah this does make me think that neural nets are a load more noisy than I’d previously considered.