Not sure what you mean here. One of the best explanations of how neural networks get trained uses basically a pure natural selection lens, and I think it gets most predictions right:
In-general I think if you use a natural selection analogy you will get a huge amount of things right about how AI works, though I agree not everything (it won’t explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning).
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
Except that selection and gradient descent are closely mathematically related—you have to make a bunch of simplifying assumptions, but ‘mutate and select’ (evolution) is actually equivalent to ‘make a small approximate gradient step’ (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don’t think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it’s possible to draw a connection stronger than “they both do local optimization and involve randomness.”)
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Fixed ‘fitness function’ or objective function mapping genome to continuous ‘fitness score’
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.
Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I’d say?
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired.
Hmmm, this strikes me as much too strong (especially ‘this lets you directly mold the circuits’).
Remember also that with RLHF, we’re learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff.
I also think there’s a fair alternative analogy to be drawn like
evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights
within-lifetime-learning of organism ~ in-context something-something of NN
(this is one analogy I commonly drew before RLHF came along.)
So, look, the analogies are loose, but they aren’t baseless.
It won’t explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning
CGP Grey’s video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it)
If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.
You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up
C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.
Not sure what you mean here. One of the best explanations of how neural networks get trained uses basically a pure natural selection lens, and I think it gets most predictions right:
CGP Grey “How AIs, like ChatGPT, Learn” https://www.youtube.com/watch?v=R9OHn5ZF4Uo
There is also a follow-up video that explains SGD:
CGP Grey “How AI, Like ChatGPT, *Really* Learns” https://www.youtube.com/watch?v=wvWpdrfoEv0
In-general I think if you use a natural selection analogy you will get a huge amount of things right about how AI works, though I agree not everything (it won’t explain the difference between Adam and AdamW, but it will explain the difference between hierarchical bayesian networks, linear regression and modern deep learning).
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:
3Blue1Brown—Gradient descent, how neural networks learn
Emergent Garden—Watching Neural Networks Learn
WIRED—Computer Scientist Explains Machine Learning in 5 Levels of Difficulty
Except that selection and gradient descent are closely mathematically related—you have to make a bunch of simplifying assumptions, but ‘mutate and select’ (evolution) is actually equivalent to ‘make a small approximate gradient step’ (SGD) in the limit of small steps.
I read the post and left my thoughts in a comment. In short, I don’t think the claimed equivalence in the post is very meaningful.
(Which is not to say the two processes have no relationship whatsoever. But I am skeptical that it’s possible to draw a connection stronger than “they both do local optimization and involve randomness.”)
Awesome, I saw that comment—thanks, and I’ll try to reply to it in more detail.
It looks like you’re not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too—though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process.
I think even on this basis though, it’s going too far to claim that the best we can say is “they both do local optimization and involve randomness”! The steps are systematically pointed up/down the local fitness gradient, for one. And they’re based on a sample-based stochastic realisation for another.
I don’t want you to get the impression I’m asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).
I’m disputing both. Re: math, the noise in your model isn’t distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm. (I know you did mention the latter issue, but IMO it rules out calling this an “equivalence.”)
I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don’t feel like I can trust without seeing a mathematical presentation.
(FWIW, if we have a population that’s “spread out” over some region of a high-dim NN loss landscape—even if it’s initially a small / infinitesimal region—I expect it to quickly split up into lots of disjoint “tendrils,” something like dye spreading in water. Consider what happens e.g. at saddle points. So the population will rapidly “speciate” and look like an ensemble of GD trajectories instead of just one.
If your model assumes by fiat that this can’t happen, I don’t think it’s relevant to training NNs with SGD.)
Wait, you think that a model which doesn’t speciate isn’t relevant to SGD? I’ll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point?
In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the ‘eventually-universal mixing’ assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It’s obviously a fascinating topic, but I think pretty irrelevant to this analogy.
For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
(irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by ‘step size depends on the gradient norm’, so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family.]
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.
Right. And in the context of these explainer videos, the particular evolution described has the properties which make it near-equivalent to SGD, I’d say?
Hmmm, this strikes me as much too strong (especially ‘this lets you directly mold the circuits’).
Remember also that with RLHF, we’re learning a reward model which is something like the more-hardcoded bits of brain-stuff, which is in turn providing updates to the actually-acting artefact, which is something like the more-flexibly-learned bits of brain-stuff.
I also think there’s a fair alternative analogy to be drawn like
evolution of genome (including mostly-hard-coded brain-stuff) ~ SGD (perhaps +PBT) of NN weights
within-lifetime-learning of organism ~ in-context something-something of NN
(this is one analogy I commonly drew before RLHF came along.)
So, look, the analogies are loose, but they aren’t baseless.
Source?
CGP Grey’s video is a decent example source. Most of the differences between hierarchical bayesian networks and modern deep learning come across pretty well if you model the latter as a type of genetic algorithm search:
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
Training consists of a huge amount of trial and error where you take datapoints, predict something about the result, then search for nearby modifications that do better, then repeat until performance plateaus.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
There are also just actually deep similarities. Vanilla SGD is perfectly equivalent to a genetic search with an infinitesimally small mutation size and infinite samples per generation (I could make a proof here but won’t unless someone is interested in it). Indeed in one of my ML classes at Berkeley genetic algorithms were suggested as one of the obvious things to do in an indifferentiable loss-landscape as generalization of SGD, where you just try some mutations, see which one performs best, and then modify your parameters in that direction.
Oh, I actually did that a year or so ago
Two observations:
If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.
You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up
C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.