This is probably the decision I make I am the least confident in, figuring out how to do accounting on this issue is challenging and depends a lot on what one is going to use the “cost” of a training run to reason about. Some questions I had in mind when thinking about cost:
If a lone actor want to train a frontier model, without loans or financial assistance from others, how much capitol might they need.
How much money should I expect to have been spent by an AI lab that trains a new frontier model, especially a frontier model that is a significant advancement over all prior models (like GPT-4 was).
What is the largest frontier model it is feasible to create by any entity.
When a company trains a frontier model, how much are they “betting” on the future profitability of AI?
The simple initial way I use to compute cost than is to investigate empirical evidence of the expenditures of companies and investment.
Now, these numbers aren’t the same ones a company might care about—they represent expenses without accounting for likely revenue. The argument I find most tempting is that one should look at deprecation cost instead of capital expenditure, effectively subtracting the expected resale value of the hardware from the initial expenditure to purchase the hardware. I have two main reasons for not using this:
Computing deprecation cost is really hard, especially in this rapidly changing environment.
The resale value of an ML GPU is likely closely tied to profitability of training a model—if it turns out that using frontier models for inference isn’t very profitable than I’d expect the value of ML GPUs to decrease. Conversely, if inference is very profitable than the resale value would increase. I think A100s for example have had their price substantially impacted by increased interest in AI - it’s not implausible to me that the resale value of an A100 is actually higher than the initial cost was for OpenAI.
Having said all of this, I’m still not confident I made the right call here.
Also, I am relatively confident GPT-4 was trained only with A100s, and did not use any V100s as the colab notebook you linked speculates. I expect that GPT-3, GPT-4, and GPT-5 will all be trained with different generations of GPUs.
Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.
This is probably the decision I make I am the least confident in, figuring out how to do accounting on this issue is challenging and depends a lot on what one is going to use the “cost” of a training run to reason about. Some questions I had in mind when thinking about cost:
If a lone actor want to train a frontier model, without loans or financial assistance from others, how much capitol might they need.
How much money should I expect to have been spent by an AI lab that trains a new frontier model, especially a frontier model that is a significant advancement over all prior models (like GPT-4 was).
What is the largest frontier model it is feasible to create by any entity.
When a company trains a frontier model, how much are they “betting” on the future profitability of AI?
The simple initial way I use to compute cost than is to investigate empirical evidence of the expenditures of companies and investment.
Now, these numbers aren’t the same ones a company might care about—they represent expenses without accounting for likely revenue. The argument I find most tempting is that one should look at deprecation cost instead of capital expenditure, effectively subtracting the expected resale value of the hardware from the initial expenditure to purchase the hardware. I have two main reasons for not using this:
Computing deprecation cost is really hard, especially in this rapidly changing environment.
The resale value of an ML GPU is likely closely tied to profitability of training a model—if it turns out that using frontier models for inference isn’t very profitable than I’d expect the value of ML GPUs to decrease. Conversely, if inference is very profitable than the resale value would increase. I think A100s for example have had their price substantially impacted by increased interest in AI - it’s not implausible to me that the resale value of an A100 is actually higher than the initial cost was for OpenAI.
Having said all of this, I’m still not confident I made the right call here.
Also, I am relatively confident GPT-4 was trained only with A100s, and did not use any V100s as the colab notebook you linked speculates. I expect that GPT-3, GPT-4, and GPT-5 will all be trained with different generations of GPUs.
Speaking as someone who has had to manage multi-million dollar cloud budgets (though not in an AI / ML context), I agree that this is hard.
As you note, there are many ways to think about the cost of a given number of GPU-hours. No one approach is “correct”, as it depends heavily on circumstances. But we can narrow it down a bit: I would suggest that the cost is always substantially higher than the theoretical optimum one might get by taking the raw GPU cost and applying a depreciation factor.
As soon as you try to start optimizing costs – say, by reselling your GPUs after training is complete, or reusing training GPUs for inference – you run into enormous challenges. For example:
When is training “complete”? Maybe you discover a problem and need to re-run part of the training process.
You may expect to train another large model in N months, but if you sell your training GPUs, you can’t necessarily be confident (in the current market) of being able to buy new ones on demand.
If you plan to reuse GPUs for inference once training is done… well, it’s unlikely that the day after training is complete, your inference workload immediately soaks up all of those GPUs. Production (inference) workloads are almost always quite variable, and 100% hardware utilization is an unattainable goal.
The actual process of buying and selling hardware entails all sorts of overhead costs, from physically racking and un-racking the hardware, to finding a supplier / buyer, etc.
The closest you can come to the theoretical optimum is if you are willing to scale your workload to the available hardware, i.e. you buy a bunch of GPUs (or lease them at a three-year-commitment rate) and then scale your training runs to precisely utilize the GPUs you bought. In theory, you are then getting your GPU-hours at the naive “hardware cost divided by depreciation period” rate. However, you are now allowing your hardware capacity to dictate your R&D schedule, which is its own implicit cost – you may be paying an opportunity cost by training more slowly than you’d like, or you may be performing unnecessary training runs just to soak up the hardware.