As I heard the current discourse about scaling laws, is that we trained the models on the entire (textual) internet, and still haven’t found superintelligence. They started to use video and audio data, but these sources are less meaning-dense (encoding a picture of a room takes as much data as two books of of characters), so we could be at the edge of scaling unless a new, more data-efficient model arises.
Let’s assume humans can generate 40 bit/sec meaningful data. Estimates of information density in natural spoken languages converges around this number cross-culturally, so it could be a natural processing limit in the brain. Quarter-third of the day is spent sleeping, and not every interaction is captured (yet. Distopian surveillence will increse it, but) for now let’s take the half of it for an avarege day. 20 bit/sec * 86400 sec/day ≈ 210 KB/day/human. Population of 8G means 1640 TB/day. A post I found with DuckDuckGo says 402M TB/day is created on the net, which would be a 0.4% mening-to-data efficiency, and given a large amount of it is video, it is beliveable. GPT-2 was released in 2020, so let’s take 5 years, this would be 2.8 EB of meaning, and given a token is 16 bit, this would be 1.4E of tokens in the past five years.
Plugging it into the Chinchilla scaling it would mean an 0.00048 loss-floot-contribution, which would be insignificant compared to the ideal generative model, which has a 1.87 lossfloor. D/N=20 means 700P of paramaters, about 1000 times larger than current models. Let’s assume a 3.7x increase/year, it would mean at the end of transformers will be reached in 5-6 years. But this is some back-of-the napkin calculation, I think it’s more like an upper-bound.
How much time we have with transformers?
As I heard the current discourse about scaling laws, is that we trained the models on the entire (textual) internet, and still haven’t found superintelligence. They started to use video and audio data, but these sources are less meaning-dense (encoding a picture of a room takes as much data as two books of of characters), so we could be at the edge of scaling unless a new, more data-efficient model arises.
Let’s assume humans can generate 40 bit/sec meaningful data. Estimates of information density in natural spoken languages converges around this number cross-culturally, so it could be a natural processing limit in the brain. Quarter-third of the day is spent sleeping, and not every interaction is captured (yet. Distopian surveillence will increse it, but) for now let’s take the half of it for an avarege day. 20 bit/sec * 86400 sec/day ≈ 210 KB/day/human. Population of 8G means 1640 TB/day. A post I found with DuckDuckGo says 402M TB/day is created on the net, which would be a 0.4% mening-to-data efficiency, and given a large amount of it is video, it is beliveable. GPT-2 was released in 2020, so let’s take 5 years, this would be 2.8 EB of meaning, and given a token is 16 bit, this would be 1.4E of tokens in the past five years.
Plugging it into the Chinchilla scaling it would mean an 0.00048 loss-floot-contribution, which would be insignificant compared to the ideal generative model, which has a 1.87 lossfloor. D/N=20 means 700P of paramaters, about 1000 times larger than current models. Let’s assume a 3.7x increase/year, it would mean at the end of transformers will be reached in 5-6 years. But this is some back-of-the napkin calculation, I think it’s more like an upper-bound.