Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
Empirically fitting scaling laws for whatever task we care about
Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).
Not quite. What you said is a reasonable argument, but the graph is noisy enough, and the theoretical arguments convincing enough, that I still assign >50% credence that data (number of feedback loops) should be proportional to parameters (exponent=1).
My argument is that even if the exponent is 1, the coefficient corresponding to horizon length (‘1e5 from multiple-subjective-seconds-per-feedback-loop’, as you said) is hard to estimate.
There are two ways of estimating this factor
Empirically fitting scaling laws for whatever task we care about
Reasoning about the nature of the task and how long the feedback loops are
Number 1 requires a lot of experimentation, choosing the right training method, hyperparameter tuning, etc. Even OpenAI made some mistakes on those experiments. So probably only a handful of entities can accurately measure this coefficient today, and only for known training methods!
Number 2, if done naively, probably overestimates training requirements. When someone learns to run a company, a lot of the relevant feedback loops probably happen on timescales much shorter than months or years. But we don’t know how to perform this decomposition of long-horizon tasks into sets of shorter-horizon tasks, how important each of the subtasks are, etc.
We can still use the bioanchors approach: pick a broad distribution over horizon lengths (short, medium, long). My argument is that outperforming bioanchors by making more refined estimates of horizon length seems too hard in practice to be worth the effort, and maybe we should lean towards shorter horizons being more relevant (because so far we have seen a lot of reduction from longer-horizon tasks to shorter-horizon learning problems, eg expert iteration or LLM pretraining).
OK, I think we are on the same page then. Thanks.