The Data Wall is Important

Link post

Modern AI is trained on a huge fraction of the internet, especially at the cutting edge, with the best models trained on close to all the high quality data we’ve got.[1] And data is really important! You can scale up compute, you can make algorithms more efficient, or you can add infrastructure around a model to make it more useful, but on the margin, great datasets are king. And, naively, we’re about to run out of fresh data to use.

It’s rumored that the top firms are looking for ways to get around the data wall. One possible approach is having LLMs create their own data to train on, for which there is kinda-sorta a precedent from, e.g. modern chess AIs learning by playing games against themselves.[2] Or just finding ways to make AI dramatically more sample efficient with the data we’ve already got: the existence of human brains proves that this is, theoretically, possible.[3]

But all we have, right now, are rumors. I’m not even personally aware of rumors that any lab has cracked the problem: certainly, nobody has come out and say so in public! There’s a lot of insinuation that the data wall is not so formidable, but no hard proof. And if the data wall is a hard blocker, it could be very hard to get AI systems much stronger than they are now.

If the data wall stands, what would we make of today’s rumors? There’s certainly an optimistic mood about progress coming from AI company CEOs, and a steady trickle of not-quite-leaks that exciting stuff is going on behind the scenes, and to stay tuned. But there are at least two competing explanations for all this:

Top companies are already using the world’s smartest human minds to crack the data wall, and have all but succeeded.

Top companies need to keep releasing impressive stuff to keep the money flowing, so they declare, both internally and externally, that their current hurdles are surmountable.

There’s lots of precedent for number two! You may have heard of startups hard coding a feature and then scrambling to actually implement it when there’s interest. And race dynamics make this even more likely: if OpenAI projects cool confidence that it’s almost over the data wall, and Anthropic doesn’t, then where will all the investors, customers, and high profile corporate deals go? There also could be an echo chamber effect, where one firm acting like the data wall’s not a big deal makes other firms take their word for it.

I don’t know what a world with a strong data wall looks like in five years. I bet it still looks pretty different than today! Just improving GPT-4 level models around the edges, giving them better tools and scaffolding, should be enough to spur massive economic activity and, in the absence of government intervention, job market changes. We can’t unscramble the egg. But the “just trust the straight line on the graph” argument is ignoring that one of the determinants of that line is running out. There’s a world where the line is stronger than that particular constraint, and a new treasure trove of data appears in time. But there’s also a world where it isn’t, and we’re near the inflection of an S-curve.

Rumors and projected confidence can’t tell us which world we’re in.

  1. ^

    For good analysis of this, search for the heading “The data wall” here.

  2. ^

    But don’t take this parallel too far! Chess AI (or AI playing any other game) has a signal of “victory” that it can seek out—it can preferentially choose moves that systematically lead to the “my side won the game” outcome. But the core of a LLM is a text predictor: “winning” for it is correctly guessing what comes next in human-created text. What does self-play look like there? Merely making up fake human-created text has the obvious issue of amplifying any weaknesses the AI has—if an AI thought, for example, that humans said “fiddle dee dee” all the time for no reason, then it would put “fiddle dee dee” in lots of its synthetic data, and then AIs trained to predict on that dataset would “learn” this false assumption, and “fiddle dee dee” prevalence would go way up in LLM outputs. And this would apply to all failure modes, leading to wacky feedback loops that might make self-play models worse instead of better.

  3. ^

    Shout out to Steven Byrnes, my favorite researcher in this direction.