I only read the prompt. But I want to say: that much compute would be useful for meta-learning/NAS/AIGAs, not just scaling up DNNs. I think that would likely be a more productive research direction. And I want to make sure that people are not ONLY imagining bigger DNNs when they imagine having a bunch more compute, but also imagining how it could be used to drive fundamental advances in ML algos, which could plausibly kick of something like recursive self-improvement (even in DNNs are in some sense a dead end).
Something I’m wondering, but don’t have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I’m curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this, because they’re so expensive. But at least this paper found some pretty impressive performance gains from using the second-order terms.
Maybe throwing lots of compute at this aspect of meta-learning would help it cross a threshold of viability, like what happened for deep learning in general around 2012. I think meta-learning is a case where we should expect second-order info to be very relevant to optimizing the loss function in question, not just a way of incorporating the loss function’s curvature. In the first paper I linked, the second-order term accounts for how the base learner’s gradients depend on the meta-learner’s parameters. This seems like an important feature of what their meta-learner is trying/supposed to do, i.e., use the meta-learned update rule to guide the base learner—and the performance gains in the second paper are evidence of this. (Not all meta-learners have this structure, though, and MAML apparently doesn’t get much better when you use Hessians. Hence my lack of confidence in this story.)
It weirds me out how little NAS (Neural Architecture Search) in particular (and throwing compute at architecture search in general) is used in industry.
Well, I’ve got five tentative answers to Question One in this post. Roughly, they are: Souped-up AlphaStar, Souped-up GPT, Evolution Lite, Engineering Simulation, and Emulation Lite. Five different research programs basically. It sounds like what you are talking about is sufficiently different from these five, and also sufficiently promising/powerful/‘fun’, that it would be a worthy addition to the list basically. So, to flesh it out, maybe you could say something like “Here are some examples of meta-learning/NAS/AIGA in practice today. Here’s a sketch of what you could do if you scaled all this up +12 OOMs. Here’s some argument for why this would be really powerful.”
There’s a ton of work in meta-learning, including Neural Architecture Search (NAS). AIGA’s (Clune) is a paper that argues a similar POV to what I would describe here, so I’d check that out.
I’ll just say “why it would be powerful”: the promise of meta-learning is that—just like learned features outperform engineered features—learned learning algorithms will eventually outperform engineered learning algorithms. Taking the analogy seriously would suggest that the performance gap will be large—a quantitative step-change.
The upper limit we should anchor on is fully automated research. This helps illustrate how powerful this could be, since automating research could easily give many orders of magnitude speed up (e.g. just consider the physical limitation of humans manually inputting information about what experiment to run).
An important underlying question is how much room there is for improvement over current techniques. The idea that current DL techniques are pretty close to perfect (i.e. we’ve uncovered the fundamental principles of efficient learning (associated view: …and maybe DNNs are a good model of the brain)) seems too often implicit in some of the discussions around forecasting and scaling. I think it’s a real possibility, but I think it’s fairly unlikely (~15%, OTTMH). The main evidence for it is that 99% of published improvements don’t seem to make much difference in practice/at-scale.
Assuming that current methods are roughly optimal has two important implications: - no new fundamental breakthroughs needed for AGI (faster timelines) - no possible acceleration from fundamental algorithmic breakthroughs (slower timelines)
I only read the prompt.
But I want to say: that much compute would be useful for meta-learning/NAS/AIGAs, not just scaling up DNNs. I think that would likely be a more productive research direction. And I want to make sure that people are not ONLY imagining bigger DNNs when they imagine having a bunch more compute, but also imagining how it could be used to drive fundamental advances in ML algos, which could plausibly kick of something like recursive self-improvement (even in DNNs are in some sense a dead end).
Something I’m wondering, but don’t have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I’m curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this, because they’re so expensive. But at least this paper found some pretty impressive performance gains from using the second-order terms.
Maybe throwing lots of compute at this aspect of meta-learning would help it cross a threshold of viability, like what happened for deep learning in general around 2012. I think meta-learning is a case where we should expect second-order info to be very relevant to optimizing the loss function in question, not just a way of incorporating the loss function’s curvature. In the first paper I linked, the second-order term accounts for how the base learner’s gradients depend on the meta-learner’s parameters. This seems like an important feature of what their meta-learner is trying/supposed to do, i.e., use the meta-learned update rule to guide the base learner—and the performance gains in the second paper are evidence of this. (Not all meta-learners have this structure, though, and MAML apparently doesn’t get much better when you use Hessians. Hence my lack of confidence in this story.)
It weirds me out how little NAS (Neural Architecture Search) in particular (and throwing compute at architecture search in general) is used in industry.
Interesting, could you elaborate? I’d love to have a nice, fleshed-out answer along those lines to add to the five I came up with. :)
Sure, but in what way?
Also I’d be happy to do a quick video chat if that would help (PM me).
Well, I’ve got five tentative answers to Question One in this post. Roughly, they are: Souped-up AlphaStar, Souped-up GPT, Evolution Lite, Engineering Simulation, and Emulation Lite. Five different research programs basically. It sounds like what you are talking about is sufficiently different from these five, and also sufficiently promising/powerful/‘fun’, that it would be a worthy addition to the list basically. So, to flesh it out, maybe you could say something like “Here are some examples of meta-learning/NAS/AIGA in practice today. Here’s a sketch of what you could do if you scaled all this up +12 OOMs. Here’s some argument for why this would be really powerful.”
There’s a ton of work in meta-learning, including Neural Architecture Search (NAS). AIGA’s (Clune) is a paper that argues a similar POV to what I would describe here, so I’d check that out.
I’ll just say “why it would be powerful”: the promise of meta-learning is that—just like learned features outperform engineered features—learned learning algorithms will eventually outperform engineered learning algorithms. Taking the analogy seriously would suggest that the performance gap will be large—a quantitative step-change.
The upper limit we should anchor on is fully automated research. This helps illustrate how powerful this could be, since automating research could easily give many orders of magnitude speed up (e.g. just consider the physical limitation of humans manually inputting information about what experiment to run).
An important underlying question is how much room there is for improvement over current techniques. The idea that current DL techniques are pretty close to perfect (i.e. we’ve uncovered the fundamental principles of efficient learning (associated view: …and maybe DNNs are a good model of the brain)) seems too often implicit in some of the discussions around forecasting and scaling. I think it’s a real possibility, but I think it’s fairly unlikely (~15%, OTTMH). The main evidence for it is that 99% of published improvements don’t seem to make much difference in practice/at-scale.
Assuming that current methods are roughly optimal has two important implications:
- no new fundamental breakthroughs needed for AGI (faster timelines)
- no possible acceleration from fundamental algorithmic breakthroughs (slower timelines)