The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.

What I’m suggesting is that volume in high-dimensions can concentrate on the boundary. To be clear, when I say SGD only typically reaches the boundary, I’m talking about early stopping and the main experimental setup in your paper where training is stopped upon reaching zero train error.

We have done overtraining, which should allow SGD to penetrate into the region. This doesn’t seem to make much difference for the probabilities we get.

This does seem to invalidate the model. However, something tells me that the difference here is more about degree. Since you use the word ‘should’ I’ll use the wiggle room to propose an argument for what ‘should’ happen.

If SGD is run with early stopping, as described above, then my argument is that this is roughly equivalent to random sampling via an appeal to concentration of measure in high-dimensions.

If SGD is not run with early stopping, it’s enclosed by the boundary of zero train error functions. Because these are most likely in the interior these functions are unlikely to be produced by random sampling. Thus, on a log-log plot I’d expect overtraining to ‘tilt’ the correspondence between SGD and random sampling likelihoods downward.

**Falsifiable Hypothesis:** Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.

You seem to have updated your opinion: overtraining does make difference, but it’s not ‘huge’. Have you run a significance test for your lines of best fit? The plots as presented suggest the effect is significant.

Figure C.1.a indicates the tilting phenomena. Probabilities only go up to one so tilting down means that the most likely candidates from overstrained SGD are less likely with random sampling. Thus, unlikely random sampling candidates are more likely under SGD. At the tail, the opposite happens. Functions more likely with random sampling become less likely under SGD.

While the optimizer has a larger effect, I think the subtle question is whether the overtraining tilts in the same way each time. Figure 16 indicates yes again. This phenomena you consider to be minor is what I found most interesting about the paper.