To check my understanding: Your graph + argument is that we should be fairly uncertain about what the relevant scaling laws will be for AGI, and that it could be anywhere from (say) 0.3 to 1.6. How does this translate into variance in timelines? Well, IIRC Ajeya has 1e35 FLOP as her median for training requirements, and something like 1e16 of that comes from flop-per-subjective-second, and maybe 1e5 from multiple-subjective-seconds-per-feedback-loop/data-point, so that leaves 1e14 for data points / feedback-loops? Which is about as many as you have parameters, consistent with Ajeya’s guess at the scaling laws where the exponent is 0.8.
So if instead you had an exponent of 0.3, data would be cut in half (on a log scale) to something like 1e7? And if you had an exponent of 1.6, data would be 60%-100% more, to something like 1e24?
So, you conclude, there’s such a huge variance in what our timelines should be (like, 15 OOMs on the key variable of AGI training requirements) based on such flimsy evidence, that we should look for a better way to estimate timelines than this.
Am I understanding correctly? (This is all mental math, maybe I’m doing it wrong?)
Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
Empirically fitting scaling laws for whatever task we care about
Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).
Nice argument. I guess I have a bit more confidence in the scaling laws than you. However, I definitely still agree that our uncertainty about AGI 2023 training compute requirements should range over many OOMs.
However what does this have to do with horizon length? I guess the idea is that the proper scaling law shouldn’t be assumed to be a function of data points at all, but rather data points & what type of task you are training on, and plausibly for longer-horizon tasks you need less data (especially with techniques like imitation learning + finetuning, etc.?) Yep that also seems very plausible to me, it’s a big part of why my timelines are much shorter than Ajeya’s.
To check my understanding: Your graph + argument is that we should be fairly uncertain about what the relevant scaling laws will be for AGI, and that it could be anywhere from (say) 0.3 to 1.6. How does this translate into variance in timelines? Well, IIRC Ajeya has 1e35 FLOP as her median for training requirements, and something like 1e16 of that comes from flop-per-subjective-second, and maybe 1e5 from multiple-subjective-seconds-per-feedback-loop/data-point, so that leaves 1e14 for data points / feedback-loops? Which is about as many as you have parameters, consistent with Ajeya’s guess at the scaling laws where the exponent is 0.8.
So if instead you had an exponent of 0.3, data would be cut in half (on a log scale) to something like 1e7? And if you had an exponent of 1.6, data would be 60%-100% more, to something like 1e24?
So, you conclude, there’s such a huge variance in what our timelines should be (like, 15 OOMs on the key variable of AGI training requirements) based on such flimsy evidence, that we should look for a better way to estimate timelines than this.
Am I understanding correctly? (This is all mental math, maybe I’m doing it wrong?)
Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
Empirically fitting scaling laws for whatever task we care about
Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).
OK, I think we are on the same page then. Thanks.
Assuming I’m understanding correctly:
Nice argument. I guess I have a bit more confidence in the scaling laws than you. However, I definitely still agree that our uncertainty about AGI 2023 training compute requirements should range over many OOMs.
However what does this have to do with horizon length? I guess the idea is that the proper scaling law shouldn’t be assumed to be a function of data points at all, but rather data points & what type of task you are training on, and plausibly for longer-horizon tasks you need less data (especially with techniques like imitation learning + finetuning, etc.?) Yep that also seems very plausible to me, it’s a big part of why my timelines are much shorter than Ajeya’s.