This post provides a valuable reframing of a common question in futurology: “here’s an effect I’m interested in—what sorts of things could cause it?”
That style of reasoning ends by postulating causes. But causes have a life of their own: they don’t just cause the one effect you’re interested in, through the one causal pathway you were thinking about. They do all kinds of things.
In the case of AI and compute, it’s common to ask
Here’s a hypothetical AI technology. How much compute would it require?
But once we have an answer to this question, we can always ask
Here’s how much compute you have. What kind of AI could you build with it?
If you’ve asked the first question, you ought to ask the second one, too.
The first question includes a hidden assumption: that the imagined technology is a reasonable use of the resources it would take to build. This isn’t always true: given those resources, there may be easier ways to accomplish the same thing, or better versions of that thing that are equally feasible. These facts are much easier to see when you fix a given resource level, and ask yourself what kinds of things you could do with it.
This high-level point seems like an important contribution to the AI forecasting conversation. The impetus to ask “what does future compute enable?” rather than “how much compute might TAI require?” influenced my own view of Bio Anchors, an influence that’s visible in the contrarian summary at the start of this post.
I find the specific examples much less convincing than the higher-level point.
For the most part, the examples don’t demonstrate that you could accomplish any particular outcome applying more compute. Instead, they simply restate the idea that more compute is being used.
They describe inputs, not outcomes. The reader is expected to supply the missing inference: “wow, I guess if we put those big numbers in, we’d probably get magical results out.” But this inference is exactly what the examples ought to be illustrating. We already know we’re putting in +12 OOMs; the question is what we get out, in return.
This is easiest to see with Skunkworks, which amounts to: “using 12 OOMs more compute in engineering simulations, with 6 OOMs allocated to the simulations themselves, and the other 6 to evolutionary search.” Okay—and then what? What outcomes does this unlock?
We could replace the entire Skunkworks example with the sentence “+12 OOMs would be useful for engineering simulations, presumably?” We don’t even need to mention that evolutionary search might be involved, since (as the text notes) evolutionary search is one of the tools subsumed under the category “engineering simulations.”
Amp suffers from the same problem. It includes two sequential phases:
Training a scaled-up, instruction-tuned GPT-3.
Doing an evolutionary search over “prompt programs” for the resulting model.
Each of the two steps takes about 1e34 FLOP, so we don’t get the second step “for free” by spending extra compute that went unused in the first. We’re simply training a big model, and then doing a second big project that takes the same amount of compute as training the model.
We could also do the same evolutionary search project in our world, with GPT-3. Why haven’t we? It would be smaller-scale, of course, just as GPT-3 is smaller scale than “GPT-7” (but GPT-3 was worth doing!).
With GPT-3′s budget of 3.14e23 FLOP, we could to do a GPT-3 variant of AMP with, for example,
10000 evaluations or “1 subjective day” per run (vs “3 subjective years”)
population and step count ~1600 (vs ~50000), or two different values for population and step count whose product is 1600^2
100,000,000 evaluations per run (Amp) sure sounds like a lot, but then, so does 10000 (above). Is 1600 steps “not enough”? Not enough for what? (For that matter, is 50000 steps even “enough” for whatever outcome we are interested in?)
The numbers sound intuitively big, but they have no sense of scale, because we don’t know how they relate to outcomes. What do we get in return for doing 50000 steps instead of 1600, or 1e8 function evaluations instead of 1e5? What capabilities do we expect out of Amp? How does the compute investment cause those capabilities?
The question “What could you do with +12 OOMs of Compute?” is an important one, and this post deserves credit for raising it.
The concrete examples of “fun” are too fun for their own good. They’re focused on sounding cool and big, not on accomplishing anything. Little would be lost if they were replaced with the sentence “we could dramatically scale up LMs, game-playing RL, artificial life, engineering simulations, and brain simulations.”
Answering the question in a less “fun,” more outcomes-focused manner sounds like a valuable exercise, and I’d love to read a post like that.
Thanks for this thoughtful review! Below are my thoughts:
--I agree that this post contributes to the forecasting discussion in the way you mention. However, that’s not the main way I think it contributes. I think the main way it contributes is that it operationalizes a big timelines crux & forcefully draws people’s attention to it. I wrote this post after reading Ajeya’s Bio Anchors report carefully many times, annotating it, etc. and starting several gdocs with various disagreements. I found in doing so that some disagreements didn’t change the bottom line much, while others were huge cruxes. This one was the biggest crux of all, so I discarded the rest and focused on getting this out there. And I didn’t even have the energy/time to really properly argue for my side of the crux—there’s so much more I could say!--so I contented myself with having the conclusion be “here’s the crux, y’all should think and argue about this instead of the other stuff.
”—I agree that there’s a lot more I could have done to argue that OmegaStar, Amp(GPT-7), etc. would be transformative. I could have talked about scaling laws, about how AlphaStar is superhuman at Starcraft and therefore OmegaStar should be superhuman at all games, etc. I could have talked about how Amp(GPT-7) combines the strengths of neural nets and language models with the strengths of traditional software. Instead I just described how they were trained, and left it up to the reader to draw conclusions. This was mainly because of space/time constraints (it’s a long post already; I figured I could always follow up later, or in the comments. I had hoped that people would reply with objections to specific designs, e.g. “OmegaStar won’t work because X” and then I could have a conversation in the comments about it. A secondary reason was infohazard stuff—it was already a bit iffy for me to be sketching AGI designs on the internet, even though I was careful to target +12 OOMs instead of +6; it would have been worse if I had also forcefully argued that the designs would succeed in creating something super powerful. (This is also a partial response to your critique about the numbers being too big, too fun—I could have made much the same point with +6 OOMs instead of +12 (though not with, say, +3 OOMs, those numbers would be too small) but I wanted to put an extra smidgen of distance between the post and ‘here’s a bunch of ideas for how to build AGI soon.’)
Anyhow, so the post you say you would love to read, I too would love to read. I’d love to write it as well. It could be a followup to this one. That said, to be honest I probably don’t have time to devote to making it, so I hope someone else does instead! (Or, equally good IMO, would be someone writing a post explaining why none of the 5 designs I sketched would work. Heck I think I’d like that even more, since it would tell me something I don’t already think I know.)
This post provides a valuable reframing of a common question in futurology: “here’s an effect I’m interested in—what sorts of things could cause it?”
That style of reasoning ends by postulating causes. But causes have a life of their own: they don’t just cause the one effect you’re interested in, through the one causal pathway you were thinking about. They do all kinds of things.
In the case of AI and compute, it’s common to ask
Here’s a hypothetical AI technology. How much compute would it require?
But once we have an answer to this question, we can always ask
Here’s how much compute you have. What kind of AI could you build with it?
If you’ve asked the first question, you ought to ask the second one, too.
The first question includes a hidden assumption: that the imagined technology is a reasonable use of the resources it would take to build. This isn’t always true: given those resources, there may be easier ways to accomplish the same thing, or better versions of that thing that are equally feasible. These facts are much easier to see when you fix a given resource level, and ask yourself what kinds of things you could do with it.
This high-level point seems like an important contribution to the AI forecasting conversation. The impetus to ask “what does future compute enable?” rather than “how much compute might TAI require?” influenced my own view of Bio Anchors, an influence that’s visible in the contrarian summary at the start of this post.
I find the specific examples much less convincing than the higher-level point.
For the most part, the examples don’t demonstrate that you could accomplish any particular outcome applying more compute. Instead, they simply restate the idea that more compute is being used.
They describe inputs, not outcomes. The reader is expected to supply the missing inference: “wow, I guess if we put those big numbers in, we’d probably get magical results out.” But this inference is exactly what the examples ought to be illustrating. We already know we’re putting in +12 OOMs; the question is what we get out, in return.
This is easiest to see with Skunkworks, which amounts to: “using 12 OOMs more compute in engineering simulations, with 6 OOMs allocated to the simulations themselves, and the other 6 to evolutionary search.” Okay—and then what? What outcomes does this unlock?
We could replace the entire Skunkworks example with the sentence “+12 OOMs would be useful for engineering simulations, presumably?” We don’t even need to mention that evolutionary search might be involved, since (as the text notes) evolutionary search is one of the tools subsumed under the category “engineering simulations.”
Amp suffers from the same problem. It includes two sequential phases:
Training a scaled-up, instruction-tuned GPT-3.
Doing an evolutionary search over “prompt programs” for the resulting model.
Each of the two steps takes about 1e34 FLOP, so we don’t get the second step “for free” by spending extra compute that went unused in the first. We’re simply training a big model, and then doing a second big project that takes the same amount of compute as training the model.
We could also do the same evolutionary search project in our world, with GPT-3. Why haven’t we? It would be smaller-scale, of course, just as GPT-3 is smaller scale than “GPT-7” (but GPT-3 was worth doing!).
With GPT-3′s budget of 3.14e23 FLOP, we could to do a GPT-3 variant of AMP with, for example,
10000 evaluations or “1 subjective day” per run (vs “3 subjective years”)
population and step count ~1600 (vs ~50000), or two different values for population and step count whose product is 1600^2
100,000,000 evaluations per run (Amp) sure sounds like a lot, but then, so does 10000 (above). Is 1600 steps “not enough”? Not enough for what? (For that matter, is 50000 steps even “enough” for whatever outcome we are interested in?)
The numbers sound intuitively big, but they have no sense of scale, because we don’t know how they relate to outcomes. What do we get in return for doing 50000 steps instead of 1600, or 1e8 function evaluations instead of 1e5? What capabilities do we expect out of Amp? How does the compute investment cause those capabilities?
The question “What could you do with +12 OOMs of Compute?” is an important one, and this post deserves credit for raising it.
The concrete examples of “fun” are too fun for their own good. They’re focused on sounding cool and big, not on accomplishing anything. Little would be lost if they were replaced with the sentence “we could dramatically scale up LMs, game-playing RL, artificial life, engineering simulations, and brain simulations.”
Answering the question in a less “fun,” more outcomes-focused manner sounds like a valuable exercise, and I’d love to read a post like that.
Thanks for this thoughtful review! Below are my thoughts:
--I agree that this post contributes to the forecasting discussion in the way you mention. However, that’s not the main way I think it contributes. I think the main way it contributes is that it operationalizes a big timelines crux & forcefully draws people’s attention to it. I wrote this post after reading Ajeya’s Bio Anchors report carefully many times, annotating it, etc. and starting several gdocs with various disagreements. I found in doing so that some disagreements didn’t change the bottom line much, while others were huge cruxes. This one was the biggest crux of all, so I discarded the rest and focused on getting this out there. And I didn’t even have the energy/time to really properly argue for my side of the crux—there’s so much more I could say!--so I contented myself with having the conclusion be “here’s the crux, y’all should think and argue about this instead of the other stuff.
”—I agree that there’s a lot more I could have done to argue that OmegaStar, Amp(GPT-7), etc. would be transformative. I could have talked about scaling laws, about how AlphaStar is superhuman at Starcraft and therefore OmegaStar should be superhuman at all games, etc. I could have talked about how Amp(GPT-7) combines the strengths of neural nets and language models with the strengths of traditional software. Instead I just described how they were trained, and left it up to the reader to draw conclusions. This was mainly because of space/time constraints (it’s a long post already; I figured I could always follow up later, or in the comments. I had hoped that people would reply with objections to specific designs, e.g. “OmegaStar won’t work because X” and then I could have a conversation in the comments about it. A secondary reason was infohazard stuff—it was already a bit iffy for me to be sketching AGI designs on the internet, even though I was careful to target +12 OOMs instead of +6; it would have been worse if I had also forcefully argued that the designs would succeed in creating something super powerful. (This is also a partial response to your critique about the numbers being too big, too fun—I could have made much the same point with +6 OOMs instead of +12 (though not with, say, +3 OOMs, those numbers would be too small) but I wanted to put an extra smidgen of distance between the post and ‘here’s a bunch of ideas for how to build AGI soon.’)
Anyhow, so the post you say you would love to read, I too would love to read. I’d love to write it as well. It could be a followup to this one. That said, to be honest I probably don’t have time to devote to making it, so I hope someone else does instead! (Or, equally good IMO, would be someone writing a post explaining why none of the 5 designs I sketched would work. Heck I think I’d like that even more, since it would tell me something I don’t already think I know.)