I agree that if you knew nothing about DL you’d be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you’d be better off deferring to local knowledge about DL than to the analogy.
Or, what’s more to the point—I think you’d better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka’s comments, which seem similar.
The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don’t know exist.
It’s true that the structure of the solution is discovered and complex—but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it’s metabolic costs are low. So the resemblance seems shallow other than “solutions can be complex.” I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you’re probably just wrong.
There’s a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate
As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.
You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances
As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don’t get grokking, you just get a broken solution. This seems to round off to, “Sometimes things change faster than others,” which is certainly true but not predictively useful, or in any event not a prediction that you couldn’t get from other places.
Like, leaving these to the side—I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
Again, let’s take “the brain” as an example of something to which you could analogize DL.
There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.
As far as I know—you can tell me if there are contrary examples—there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I’m aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.
I think in many cases it’s a bad idea to analogize from the brain to DL! They’re quite different systems.
But they’re more similar than evolution and DL, and if you’d not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it’s very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn’t necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
I think you’d better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall ‘serial depth’ thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I’d love to know what you’re referring to by this:
evolution… is fine with a mutation that leads to 10^7 serial ops if it’s metabolic costs are low.
Also,
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
I think the jury is still out on this, but there’s literature on it (probably much more I haven’t fished out). [EDIT: also see this comment which has some other examples]
AFAIK there’s no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is.
I mean, does it matter? What if it turns out that gradient descent itself doesn’t affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn’t an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific ‘nicer’ space than reality, and they’re (I believe from what I remember) comparing to a good bound with a constant stepsize of 1.
So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
It’s always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.
I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it’s plausible that evolution analogies are decent for prediction and bad for capabilities. Don’t have time to think this through further unless you want to engage.
One more thought on learning rates and mutation rates:
As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware?
This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down.
Does anyone know if decreasing learning rate is optimal even when model complexity doesn’t increase over time?
I agree that if you knew nothing about DL you’d be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.
I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you’d be better off deferring to local knowledge about DL than to the analogy.
Or, what’s more to the point—I think you’d better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.
Combining some of yours and Habryka’s comments, which seem similar.
It’s true that the structure of the solution is discovered and complex—but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it’s metabolic costs are low. So the resemblance seems shallow other than “solutions can be complex.” I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you’re probably just wrong.
As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don’t get grokking, you just get a broken solution. This seems to round off to, “Sometimes things change faster than others,” which is certainly true but not predictively useful, or in any event not a prediction that you couldn’t get from other places.
Like, leaving these to the side—I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.
Again, let’s take “the brain” as an example of something to which you could analogize DL.
There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.
As far as I know—you can tell me if there are contrary examples—there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I’m aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.
I think in many cases it’s a bad idea to analogize from the brain to DL! They’re quite different systems.
But they’re more similar than evolution and DL, and if you’d not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.
FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it’s very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn’t necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall ‘serial depth’ thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.
I’d love to know what you’re referring to by this:
Also,
I think the jury is still out on this, but there’s literature on it (probably much more I haven’t fished out). [EDIT: also see this comment which has some other examples]
AFAIK there’s no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn’t be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent.
I’m genuinely surprised at the “brains might not be doing gradients at all” take; my understanding is they are probably doing something equivalent.
Similarly this kind of paper points in the direction of LLMs doing something like brains. My active expectation is that there will be a lot more papers like this in the future.
But to be clear—my overall view of the similarity of brain to DL is admittedly fueled less by these specific papers, though, which are nice gravy for my view but not the actual foundation, and much more by what I see as the predictive power of hypotheses like this, which are massively more impressive inasmuch as they were made before Transformers had been invented. Given Transformers, the comparison seems overdetermined; I wish I had seen that way back in 2015.
Re. serial ops and priors—I need to pin down the comparison more, given that it’s mostly about the serial depth thing, and I think you already get it. The base idea is that what is “simple” to mutations and what is “simple” to DL are extremely different. Fuzzily: A mutation alters protein-folding instructions, and is indifferent to the “computational costs” of working this out in reality; if you tried to work out the analytic gradient for the mutation (the gradient over mutation → protein folding → different brain → different reward → competitors children look yummy → eat em) your computer would explode. But DL seeks only a solution that can be computed in a big ensemble of extremely short circuits, learned almost entirely specifically because of the data off of which you’ve trained. Ergo DL has very different biases, where the “complexity” for mutations probably has to do with instructional length where, “complexity” for DL is more related to how far you are from whatever biases are engrained in the data (<--this is fuzzy), and the shortcut solutions DL learns are always implied from the data.
So when you try to transfer intuitions about the “kind of solution” DL gets from evolution (which ignores this serial depth cost) to DL (which is enormously about this serial depth cost) then the intuition breaks. As far as I can tell that’s why we have this immense search for mesaoptimizers and stuff, which seems like it’s mostly just barking up the wrong tree to me. I dunno; I’d refine this more but I need to actually work.
Re. cyclic learning rates: Both of us are too nervous about the theory --> practice junction to make a call on how all this transfers to useful algos (Although my bet is that it won’t.). But if we’re reluctant to infer from this—how much more from evolution?
Mm, thanks for those resource links! OK, I think we’re mostly on the same page about what particulars can and can’t be said about these analogies at this point. I conclude that both ‘mutation+selection’ and ‘brain’ remain useful, having both is better than having only one, and care needs to be taken in any case!
As I said,
so I’m looking forward to reading those links.
Runtime optimisation/search and whatnot remain (broadly-construed) a sensible concern from my POV, though I wouldn’t necessarily (at first) look literally inside NN weights to find them. I think more likely some scaffolding is needed, if that makes sense (I think I am somewhat idiosyncratic in this)? I get fuzzy at this point and am still actively (slowly) building my picture of this—perhaps your resource links will provide me fuel here.
I mean, does it matter? What if it turns out that gradient descent itself doesn’t affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn’t an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific ‘nicer’ space than reality, and they’re (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
It’s always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.
I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it’s plausible that evolution analogies are decent for prediction and bad for capabilities. Don’t have time to think this through further unless you want to engage.
One more thought on learning rates and mutation rates:
This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down.
Does anyone know if decreasing learning rate is optimal even when model complexity doesn’t increase over time?