[Question] In the age of modern AI (LLMs and beyond), is data still the new oil?

In the age of big data and machine learning, it was common to say that data was the new oil. Every big company spent a lot of money creating their Hadoop instances, hiring data scientists, to discover new things from the data they had. It’s easy to see how the big data promise failed to live up to its promise.

Today, when companies go train their LLMs, they can use databases that are publicly available: the Common Crawl was 82% of the raw tokens used to train GPT-3 and the latest version is 428TB. The novel word in town is “high value tokens”, in this world, adding some data generated by humans exclusively for the dataset (aka: RLHF) helps way more than just adding lots of unrelated data.

But I am seeing tech executives talking a lot about how data is important, which goes opposite to my understanding of AI. Here’s Larry Ellison in the latest Oracle earnings call, replying to a question if the data gravity in his systems of record would change with the advent of AI:

Well, you know, you can’t build any of these AI models without enormous amounts of training data. So, if anything, what—what AI—generative AI has shown that the big issue about training one of these models is just getting—that this vast amount of data ingested into your GPU supercluster, it is a huge data problem in a sense you need so much data to, you know, train OpenAI, to train ChatGPT 3.5. They read the entire public internet. They read all of it, Wikipedia, they read everything.

They ingested everything. And to specialize—then you take something like ChatGPT 4.0 and you want to specialize it, you need specialized training data from electronic health records. Does it help doctors diagnose and treat cancer, let’s say. And we have partners imaging, for example, that is ingesting huge amounts of image data to train their AI models.

We have Ronin, another partner of ours in AI, ingesting huge amounts of electronic health records to train their models. AI doesn’t work without getting access to and ingesting enormous amounts of data. So, in terms of a shift away from data or a change in gravity to AI, AI is utterly dependent upon vast amounts of training data. Trillions of elements went into building ChatGPT 3.5, multiple times that for ChatGPT 4.0 because you had to deal with all the image data and ingest all of that to train—to train image recognition.

So, we think this is very good for our database business. And Oracle’s new vector database will contain highly specialized training data like electronic health records while keeping that data anonymized and private yet still training the specialized models that can—that can help doctors improve their diagnostic capability and their treatment prescriptions for cancer and heart disease and all sorts of other diseases. So, we think it’s a boon to our business, and we are now getting into the deep water of the information age. Nothing has changed about that.

The demands on data are getting stronger and more important.

This trend is common elsewhere with other companies’ executives (obviously Ellison is biased to give answers that put Oracle is a good spot).

But what Ellison is saying goes contrary to my intuition about AI. Companies like Nuance or Cerner may need data to train their medical AI LLMs, but a hospital, or even an insurer like UnitedHealth or a farmaCo like Pfizer have no edge whatsoever for having more data. Most of these tokens will be bought, generated by humans (you can literally hire 500 doctors to generate these tokens), and use what is publicly available.

In this world, some big companies will spend a lot to generate data (it’s speculated that Google already have a data budget in the billions), they’ll buy from other companies and pay human beings to generate it (it makes total sense when there’s a real possibility the world is spending $100B in GPUs in 2024). But you want high quality data and data that isn’t repetitive.

Does Less Wrong agree that data is less valuable in the new world of AI?

No comments.