According to the paper released with the information about Chinchilla (Training Compute-Optimal Large Language Models” by Hoffmann et al.), they claim ““For every doubling of model size the number of training tokens should also be doubled.”
So if this would be followed, 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed.
Chinchilla probably costed around 1-5 million usd in compute power to train, the cost for a 10 trillion parameter version would cost around 20.5 billion to 122.5 billion usd.
However, this is not a viable alternative, because there is not enough text data available.
A more realistic scenario is to perhaps double the data Chinchilla was trained on (which might not even be easily done), and then 143x the size, so a cost of about 286 million to 1.43 billion usd.
Keep in mind, that GPT-3 was trained on about a quarter of the data that Chinchilla was trained on. So a 10 trillion parameter GPT-3 model might cost around 71.5 million to 358 million usd.
I predict that when something with the anticipated capabilities of a 10T Chinchilla is developed, it will cost at the most a small multiple of GPT-3. There’s not just falling hardware costs, but algorithmic improvements and completely new methods.
It won’t cost less, because more compute always improves things, so the cost is set by the budget.
I believe that given a few years, a company wanting to make a 10 trillion parameter GPT-3 could probably do it for less than these estimates, since 71-358 million usd isn’t that much in compute, and for that money extra specialized hardware produced in bulk could be used to bring costs down.
[Chinchilla 10T would have a 143x increase in parameters and] 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed.
Would anybody be able to explain this calculation a bit? It implies that compute requirements scale linearly with the number of parameters. Is that true for transformers?
My understanding would be that making the transformer deeper would increase compute linearly with parameters, but a wider model would require more than linear compute because it increases the number of connections between nodes at each layer.
The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
Chinchilla is 70 Billion parameters. 10 trillion / 70 billion ≈ 143
According to the paper released with the information about Chinchilla (Training Compute-Optimal Large Language Models” by Hoffmann et al.), they claim ““For every doubling of model size the number of training tokens should also be doubled.”
So if this would be followed, 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed.
Chinchilla probably costed around 1-5 million usd in compute power to train, the cost for a 10 trillion parameter version would cost around 20.5 billion to 122.5 billion usd.
However, this is not a viable alternative, because there is not enough text data available.
A more realistic scenario is to perhaps double the data Chinchilla was trained on (which might not even be easily done), and then 143x the size, so a cost of about 286 million to 1.43 billion usd.
Keep in mind, that GPT-3 was trained on about a quarter of the data that Chinchilla was trained on. So a 10 trillion parameter GPT-3 model might cost around 71.5 million to 358 million usd.
I predict that when something with the anticipated capabilities of a 10T Chinchilla is developed, it will cost at the most a small multiple of GPT-3. There’s not just falling hardware costs, but algorithmic improvements and completely new methods.
It won’t cost less, because more compute always improves things, so the cost is set by the budget.
I believe that given a few years, a company wanting to make a 10 trillion parameter GPT-3 could probably do it for less than these estimates, since 71-358 million usd isn’t that much in compute, and for that money extra specialized hardware produced in bulk could be used to bring costs down.
Would anybody be able to explain this calculation a bit? It implies that compute requirements scale linearly with the number of parameters. Is that true for transformers?
My understanding would be that making the transformer deeper would increase compute linearly with parameters, but a wider model would require more than linear compute because it increases the number of connections between nodes at each layer.
The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
Ah right. Thank you!