Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.