I feel a bit confused about gradient descent being described as a selective process, and thus about this binary. Is gradient descent a selective process? It doesn’t seem like it.
All the other examples of selective processes involve… variation and selection: you have a population with variation, the population gets culled, the remaining population has more of some quality, repeat. But gradient descent does not feature this, at least not in a straightforward way. There’s no pool of candidates, no acceptance / rejection, no competition, really.
(This might have consequences, for instance, with how gradient descent can work differently from more selective / evolutionary processes. Evolutionary Strategies At Scale for instance, finds that “Evolutionary Strategies” has a different behavior when used to train an LLM than gradient descent. See also.)
But generally this binary feels pretty fuzzy to me; the MECE-ness of it, or membership criteria seems unclear.
I wrote something about this a while back: in short, with a squint gradient descent and natural selection are the same.
From my point of view, one thing that’s particularly relevant is that they’re both operating locally, with very no/little foresight, over a high-dimensional design space. You could look at GD as selecting among all the possible local steps, and ‘competing’ them based on the heuristic of their local loss gradient (as approximated by the (sampled) dataset-derived estimator).
This confusion comes about because natural selection has no mechanism to maintain variation. Equivalently, gradient descent can only work with the data provided or in other words it has no “proposal” step like Gibbs sampling or MCMC. So the idea that gradient descent and natural selection are the same feels intuitive to me.
It is also known that some models of evolutionary game theory recover Fisher’s theorem of natural selection as a consequence of the replicator equation (a model of natural selection) as a gradient flow, see this arxiv paper. [Might have bungled the explanation on this one, so take with some salt.]
I think it’s possible that gradient descent works by applying a selection pressure to preexisting circuits in the initial randomization with some finetuning. This would explain why most weights are zero after training as well as stuff like the lottery ticket hypothesis.
I feel a bit confused about gradient descent being described as a selective process, and thus about this binary. Is gradient descent a selective process? It doesn’t seem like it.
All the other examples of selective processes involve… variation and selection: you have a population with variation, the population gets culled, the remaining population has more of some quality, repeat. But gradient descent does not feature this, at least not in a straightforward way. There’s no pool of candidates, no acceptance / rejection, no competition, really.
(This might have consequences, for instance, with how gradient descent can work differently from more selective / evolutionary processes. Evolutionary Strategies At Scale for instance, finds that “Evolutionary Strategies” has a different behavior when used to train an LLM than gradient descent. See also.)
But generally this binary feels pretty fuzzy to me; the MECE-ness of it, or membership criteria seems unclear.
I wrote something about this a while back: in short, with a squint gradient descent and natural selection are the same.
From my point of view, one thing that’s particularly relevant is that they’re both operating locally, with very no/little foresight, over a high-dimensional design space. You could look at GD as selecting among all the possible local steps, and ‘competing’ them based on the heuristic of their local loss gradient (as approximated by the (sampled) dataset-derived estimator).
Some key practical differences between varying instantiations of GD/NS will be in the effective ‘proposal’/generating procedures and ‘promotion’/selection heuristics.
This confusion comes about because natural selection has no mechanism to maintain variation. Equivalently, gradient descent can only work with the data provided or in other words it has no “proposal” step like Gibbs sampling or MCMC. So the idea that gradient descent and natural selection are the same feels intuitive to me.
It is also known that some models of evolutionary game theory recover Fisher’s theorem of natural selection as a consequence of the replicator equation (a model of natural selection) as a gradient flow, see this arxiv paper. [Might have bungled the explanation on this one, so take with some salt.]
I think it’s possible that gradient descent works by applying a selection pressure to preexisting circuits in the initial randomization with some finetuning. This would explain why most weights are zero after training as well as stuff like the lottery ticket hypothesis.
As far as I know this is just false, though?