GPT-3 is about 2e11 parameters and uses about 4 flops per parameter per token, so about 1e12 flops per token.
If a human writes at 1 token per second, then you should be comparing 1e12 flops to the cost per second. I think you are implicitly comparing to the cost for a ~1000 token context?
I think 1e14 to 1e15 flops is a plausible estimate for the productive computation done by a human brain in a second, which is about 2-3 orders of magnitude beyond GPT-3.
I think this is not really a coincidence. GPT-3 is notable because it’s starting to exhibit human-like abilities. It’s not super surprising that should happen around human levels of compute, and I would personally expect the trend to continue as we scale up towards human level compute and continue improving deep learning efficiency. (I gave this about 50% probability in 2017 before seeing GPT-2, but I’ve updated significantly in favor over the last 6 years.)
More generally, I think the numbers in your post are wrong and the discussion is somewhat confused. 1e15 to 1e30 is not a narrow interval, I don’t think you should compare training costs to inference costs, 1e30 is not the training cost of GPT-3, you should probably compare to brain compute estimates like this one rather than brain emulation estimates...
But I think it’s reasonable to step back and say that compared to what you might have expected, biological anchors have been a pretty good guide to ML progress. They are losing usefulness now since at best they have like 10 years of resolution and eyeballing is getting easier and easier as we approach transformative AI. But I still find them helpful as an additional independent check to go along with eyeballing, economic extrapolations, etc. (And until recently I think they were probably the most common way people arrived at in-retrospect-reasonable-looking timeline estimates.)
Hi. Can you provide a citable reference for the “4 flops per parameter per token”? It’s for a research paper in the foundations of quantum physics. Thanks. (Howard Wiseman.)
GPT-3 is about 2e11 parameters and uses about 4 flops per parameter per token, so about 1e12 flops per token.
If a human writes at 1 token per second, then you should be comparing 1e12 flops to the cost per second. I think you are implicitly comparing to the cost for a ~1000 token context?
I think 1e14 to 1e15 flops is a plausible estimate for the productive computation done by a human brain in a second, which is about 2-3 orders of magnitude beyond GPT-3.
I think this is not really a coincidence. GPT-3 is notable because it’s starting to exhibit human-like abilities. It’s not super surprising that should happen around human levels of compute, and I would personally expect the trend to continue as we scale up towards human level compute and continue improving deep learning efficiency. (I gave this about 50% probability in 2017 before seeing GPT-2, but I’ve updated significantly in favor over the last 6 years.)
More generally, I think the numbers in your post are wrong and the discussion is somewhat confused. 1e15 to 1e30 is not a narrow interval, I don’t think you should compare training costs to inference costs, 1e30 is not the training cost of GPT-3, you should probably compare to brain compute estimates like this one rather than brain emulation estimates...
But I think it’s reasonable to step back and say that compared to what you might have expected, biological anchors have been a pretty good guide to ML progress. They are losing usefulness now since at best they have like 10 years of resolution and eyeballing is getting easier and easier as we approach transformative AI. But I still find them helpful as an additional independent check to go along with eyeballing, economic extrapolations, etc. (And until recently I think they were probably the most common way people arrived at in-retrospect-reasonable-looking timeline estimates.)
Hi. Can you provide a citable reference for the “4 flops per parameter per token”? It’s for a research paper in the foundations of quantum physics. Thanks. (Howard Wiseman.)