it’s not fair to treat all seconds of life as equally influential/important for learning
I agree and didn’t mean to imply otherwise.
In terms of what we’re discussing here, I think it’s worth noting that there’s a big overlap between “sensitive windows in such-and-such part of the cortex” and “the time period when the data is external not synthetic”.
Any predictions?
I dunno….
O’Reilly (1,2) simulated visual cortex development, and found that their learning algorithm flailed around and didn’t learn anything, unless they set it up to learn the where pathway first (with the what pathway disconnected), and only connect up the what pathway after the where pathway training has converged to a good model. (And they say there’s biological evidence for this.) (They didn’t have any retinal waves, just “real” data.)
As that example illustrates, there’s always a risk that a randomly-initialized model won’t converge to a good model upon training, thanks to a bad draw of the random seed. I imagine that there are various “tricks” that reduce the odds of this problem occurring—i.e. to make the loss landscape less bumpy, or something vaguely analogous to that. O’Reilly’s “carefully choreographed (and region-dependent) learning rates” is one such trick. I’m very open-minded to the possibility that “carefully choreographed synthetic data” is another such trick.
Anyway, I don’t particularly object to the idea “synthetic data is useful, and plausibly if you take an existing organism and remove its synthetic data it would get messed up”. I was objecting instead to the idea “synthetic data is a major difference between the performance of brains and deep RL, and thus maybe with the right synthetic data pre-training, deep RL would perform as well as brains”. I think the overwhelming majority of training on human brains involves real data—newborns don’t have object permanence or language or conceptual reasoning or anything like that, and presumably they build all those things out of a diet of actual not synthetic data. And even if you think that the learning algorithm of brains and deep RL is both gradient descent, the inference algorithm is clearly different (e.g. brains use analysis-by-synthesis), and the architectures are clearly different (e.g. brains are full of pairs of neurons where each projects to the other, whereas deep neural nets almost never have that). These are two fundamental differences that persist for the entire lifetime / duration of training, unlike synthetic data which only appears near the start. Also, the ML community has explored things like deep neural net weight initialization and curriculum learning plenty, I would just be very surprised if massive transformative performance improvements (like a big fraction of the difference between where we are and AGI) could come out of those kinds of investigation, as opposed to coming out of different architectures and learning algorithms and training data.
Thanks!! :-)
I agree and didn’t mean to imply otherwise.
In terms of what we’re discussing here, I think it’s worth noting that there’s a big overlap between “sensitive windows in such-and-such part of the cortex” and “the time period when the data is external not synthetic”.
I dunno….
O’Reilly (1,2) simulated visual cortex development, and found that their learning algorithm flailed around and didn’t learn anything, unless they set it up to learn the where pathway first (with the what pathway disconnected), and only connect up the what pathway after the where pathway training has converged to a good model. (And they say there’s biological evidence for this.) (They didn’t have any retinal waves, just “real” data.)
As that example illustrates, there’s always a risk that a randomly-initialized model won’t converge to a good model upon training, thanks to a bad draw of the random seed. I imagine that there are various “tricks” that reduce the odds of this problem occurring—i.e. to make the loss landscape less bumpy, or something vaguely analogous to that. O’Reilly’s “carefully choreographed (and region-dependent) learning rates” is one such trick. I’m very open-minded to the possibility that “carefully choreographed synthetic data” is another such trick.
Anyway, I don’t particularly object to the idea “synthetic data is useful, and plausibly if you take an existing organism and remove its synthetic data it would get messed up”. I was objecting instead to the idea “synthetic data is a major difference between the performance of brains and deep RL, and thus maybe with the right synthetic data pre-training, deep RL would perform as well as brains”. I think the overwhelming majority of training on human brains involves real data—newborns don’t have object permanence or language or conceptual reasoning or anything like that, and presumably they build all those things out of a diet of actual not synthetic data. And even if you think that the learning algorithm of brains and deep RL is both gradient descent, the inference algorithm is clearly different (e.g. brains use analysis-by-synthesis), and the architectures are clearly different (e.g. brains are full of pairs of neurons where each projects to the other, whereas deep neural nets almost never have that). These are two fundamental differences that persist for the entire lifetime / duration of training, unlike synthetic data which only appears near the start. Also, the ML community has explored things like deep neural net weight initialization and curriculum learning plenty, I would just be very surprised if massive transformative performance improvements (like a big fraction of the difference between where we are and AGI) could come out of those kinds of investigation, as opposed to coming out of different architectures and learning algorithms and training data.