Revisiting the Horizon Length Hypothesis

Summary: As part of my work at Epoch, I investigated the horizon length hypothesis—the idea that the horizon length of a task is predictive of the training compute needed to learn that task. My current (weak) conclusion is that the horizon length hypothesis can’t be used in practice to estimate the compute requirements for training transformative AI because of the difficulty of 1) measuring horizon length accurately and 2) accounting for compute-saving techniques like curriculum learning. The evidence is weak, so we decided to publish a summary of our research so far here for public scrutiny and gather feedback while we decide whether we want to work more on the topic.

Introduction

The horizon length hypothesis (HLH) asserts that the amount of compute needed to train an ML system to perform a task is , where is the model size (number of parameters), is the horizon length of the associated task, and is an exponent that is common to all tasks.

The horizon length is intuitively the length of the feedback loops involved in learning the task. For example, balancing a pole with your hand has a subsecond feedback loop, whereas running a company has months- or years-long feedback loops. So we should expect the horizon length of the task ‘Running a company’ to be up to 8 orders of magnitude bigger than that of ‘Balancing a pole’.

Dependence on the training method

Instead of directly training a model to perform a task, one can use transfer learning from a different task, or prepare a curriculum of progressively harder instances of the task. Using these techniques it’s possible to reduce the amount of task-specific data used in training.

Therefore, using these techniques we might be able to save compute by training only on small amounts of data for the long-horizon task, and doing the bulk of the training on short-horizon tasks. So for long-horizon tasks, if we naively apply the HLH directly with the horizon length of the task, we will overestimate training requirements.

However, the HLH might be valid for fairly direct training methods (no transfer, no reward shaping, etc). If this is true, then it might still be informative for predicting compute requirements. In particular, it seems likely that models will need some training at longer horizon lengths. An example of this might be LLMs, where pretraining has a horizon length of 1, but RLHF arguably has a longer horizon.

In this way, training can be decomposed into several phases (or several reward components), each with a defined horizon length. The compute required for each phase could then be estimated with the HLH. I have not tested this approach yet but I think there are relatively simple experiments that could be informative.[1]

Evidence for the HLH

We have some theoretical reasons to believe that model size should be scaled proportionally to the amount of data (that is: ). These come from bounds in statistical learning theory,[2] from some models of scaling laws, and from the fact that for some restricted classes of neural networks and functions it does seem to be true.[3] Separately, in reinforcement learning it is known that tasks with longer time horizons require proportionally more samples.[4] There are also other theoretical analyses that give different results though.

Figure 1: Optimal scaling exponents from the literature, vs the scale of the experiment. All the largest-scale experiments (>1e19 FLOP) except Chinchilla should be taken with a grain of salt since the methodology used to estimate them was wrong for language.

Meanwhile, the exponents measured in the scaling literature are often very different (Figure 1). I don’t think we should take this as a definitive rebuttal of the HLH, because these measurements are quite noisy. In particular, we have reasons to suspect that the experiments in Henighan et al. 2020 are not truly optimal (since this was the case for language). The most careful measurement of this exponent at scale for language was Chinchilla, which did give a value of 1.

The fact that a lot of these exponents are probably wrong, even though the experiments were performed by world-class teams, indicates that in practice measuring the horizon length of a task is quite hard and requires careful experimentation. This makes me skeptical about using it for forecasting.

Conclusion

The large savings in training requirements from using transfer learning, curriculum learning and similar techniques means that naively applying the HLH won’t yield good estimates of training compute requirements for long-horizon tasks.

It’s unclear whether the HLH is true even for more direct training methods. The empirical evidence is inconclusive, while the theoretical evidence seems weakly positive. If it is true, we might be able to use it to estimate training requirements by decomposing training into different components with well-defined horizon lengths.

One additional complication is that accurately measuring the horizon length of a task empirically seems quite hard. Therefore, we would need to rely on heuristics and informal arguments about the nature of the task to guess the horizon length.

Taking all of this into account, I think that using the HLH to make quantitative predictions of training compute requirements is quite hard. However, the HLH might still provide useful qualitative insights.

You can read the full report here, but note that it’s quite rough and unpolished.

  1. ^

    This would involve taking some concrete task and trying several training methods, or possibly combinations of several reward functions with different horizon length.

  2. ^

    Note that the applicability of classical learning theory for deep learning is quite contested, see for example this sequence.

  3. ^

    Eg: for perceptrons (single-layer networks) it is known to be true. Also, there are several results that show data and parameter scaling exponents of 12 (eg: for two-layer networks, for SGD in convex functions). If this turns out to be the best possible scaling in general, it would imply an optimal scaling exponent of 1.

  4. ^

    This is mainly mediated by the variance in the gradient estimates, which increases with the time horizon, see this. Reducing this variance requires using proportionally more samples.