The Data Scaling Hypothesis

This idea is already implicit in most people’s thinking, but I wanted something I could link to.

I propose we use the data scaling hypothesis to refer to the claim:

The quality and quantity of data used to train an AI model is by far the most important determinant of model performance.

It’s a subset of the Scaling Hypothesis, which encompasses other inputs like architecture, parameters, and compute. Though the scaling hypothesis already places substantial weight on data.

By data, I mean any sort of information external to the model. This includes data in the traditional sense (images, text, spreadsheets, etc.) and things like the reward an RL agent gets in an environment or verifiable answers on math questions.

Some pithy ways to rephrase the hypothesis:

  • Data Is All You Need.

  • Data Is The New Oil.

  • Give me a [dataset] large enough and a [compute cluster] on which to place it, and I shall [create AGI].

  • The Bitter Datum: General methods that leverage [more data] are ultimately more effective, and by a large margin.

I don’t intend to offer a full defense here, but I will point out that most other inputs to AI training are downstream of the data set. The format determines the architecture, the number of parameters is a function of the size of the dataset, the compute is a function of the number of parameters and dataset size. The training algorithm becomes less important at scale; from the original scaling hypothesis: ”… intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale.” As the data grows, every other input is determined by the dataset itself. The data is the key bottleneck.

The data scaling hypothesis is not automatically true; it carries an implicit assumption that these other inputs will be adapted as the dataset demands. This breaks down when we don’t have an algorithm that can effectively learn from the data. For example, we could generate contrived 3-SAT problems and their solutions ad nauseam, but nobody thinks there’s an architecture that could solve any NP-complete problem given enough of this data.

If true, I think this hypothesis has implications for AI safety and the future of AI scaling, but that’s a topic for another day.

For more detail on this idea and links to people making similar points, see this post. It would be great to gather other references so I can add them to the further reading section.

No comments.