by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho
You can find the complete article here. We provide a short summary below.
In short:
To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.
Method 1: Counting operations in the model
2×# of connectionsOperations per forward pass×3Backward-forward adjustment×# training examples×# epochsNumber of passes
Method 2: GPU time
training time×# cores×peak FLOP/s×utilization rate
We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.
You can read more about method 1 here and about method 2 here.
Other parts of interest of this article include:
We argue that the ratio of operations of backward and forward pass of neural networks is often close to 2:1. More.
We discuss how the formula of method 1 changes for recurrent models. More.
We argue that dropout does not affect the number of operations per forward and backward pass. More.
We have elaborated a table with parameter and operation counts for common neural network layers. More.
Estimating training compute of Deep Learning models
by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho
You can find the complete article here. We provide a short summary below.
In short: To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.
Method 1: Counting operations in the model
2×# of connectionsOperations per forward pass×3Backward-forward adjustment×# training examples×# epochsNumber of passes
Method 2: GPU time
training time×# cores×peak FLOP/s×utilization rate
We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.
You can read more about method 1 here and about method 2 here.
Other parts of interest of this article include:
We argue that the ratio of operations of backward and forward pass of neural networks is often close to 2:1. More.
We discuss how the formula of method 1 changes for recurrent models. More.
We argue that dropout does not affect the number of operations per forward and backward pass. More.
We have elaborated a table with parameter and operation counts for common neural network layers. More.
We give a detailed example of method 1. More.
We discuss commonly used number representation formats in ML. More.
We share an estimate of the average performance of GPU cards each year. More.
We share some reported GPU usages in real experiments. More.
We give a detailed example of method 2. More.
We compare both methods and conclude they result in similar estimates. More.
We discuss the use of profilers to measure compute. More.
Complete Article
You can find the article here.