Why is it so often that analogies are drawn between natural selection and gradient descent in a machine learning context? They are both optimizing over a fitness function, but isn’t there an important difference in what they are optimizing over?
Natural selection is broadly optimizing over the architecture, initial parameters of the architecture, and the learning dynamics (how one updates the parameters of the architecture given data), which led to the architecture of the brain and methods of learning like STDP, in which the parameters of the architecture are the neurons of the brain.
Isn’t gradient descent instead what we pick to be the learning dynamics, where we then pick our architecture (e.g. transformer) and initial parameters (e.g. Xavier initialization), so actually it makes more sense to draw an analogy between gradient descent and the optimizer learnt by natural selection (STDP, etc.), as opposed to natural selection itself?
Though natural selection is a simple optimization process, the optimizer (learning dynamics) learnt by this process could be very complex, and so reasoning like ‘natural selection is simple so maybe the simplicity of gradient descent is sufficient’ is not very strong?