Estimating training compute of Deep Learning models

by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, and Anson Ho

You can find the complete article here. We provide a short summary below.

In short: To estimate the compute used to train a Deep Learning model we can either: 1) directly count the number of operations needed or 2) estimate it from GPU time.

Method 1: Counting operations in the model

Method 2: GPU time

We are uncertain about what utilization rate is best, but our recommendation is 30% for Large Language Models and 40% for other models.

You can read more about method 1 here and about method 2 here.

Other parts of interest of this article include:

  • We argue that the ratio of operations of backward and forward pass of neural networks is often close to 2:1. More.

  • We discuss how the formula of method 1 changes for recurrent models. More.

  • We argue that dropout does not affect the number of operations per forward and backward pass. More.

  • We have elaborated a table with parameter and operation counts for common neural network layers. More.

  • We give a detailed example of method 1. More.

  • We discuss commonly used number representation formats in ML. More.

  • We share an estimate of the average performance of GPU cards each year. More.

  • We share some reported GPU usages in real experiments. More.

  • We give a detailed example of method 2. More.

  • We compare both methods and conclude they result in similar estimates. More.

  • We discuss the use of profilers to measure compute. More.

Complete Article

You can find the article here.