[Question] Is OpenAI losing money on each request?

While working on another post, I decided to follow up some details by doing some naive modeling of OpenAI’s LLM API revenue stream. The naive approach seems inadequate, because it implies OpenAI requires many years to break even just on the cost of GPUs.

  • OpenAI charges the following rates (from the OpenAI pricing page):

    • GPT-3.5 Turbo: input $0.001/​1k tokens, output $0.002/​1k tokens.

    • GPT-4 (non-Turbo): input $0.03/​1k tokens, output $0.06/​1k tokens.

  • How quickly do the GPTs generate tokens? Data pulled from some random people doing testing on Reddit, of all places in the local LLaMA subreddit. The post is 4 months old, so they were testing 3.5 Turbo and 4 non-Turbo (4 Turbo launched earlier this month).

    • GPT-3.5 Turbo: ~100 tokens/​s

    • GPT-4 (non-Turbo): ~12-13 tokens/​s

    • This is the weakest part of the analysis, it’s just some people doing tests with a stopwatch. If you have a better source please let me know.

  • With this data, we can calculate revenue for a single model running at 100% utilization.

    • Template: (<? tokens/​s>) · (<? $/​1k tokens>) · (31,557,600s/​year) = $/​year

    • GPT-3.5 Turbo: $6,311.52/​year

    • GPT-4 (non-Turbo): $18,934.56/​year

    • I’m unsure of how to model the input tokens.

  • How much does it cost to run a single model?

    • The last GPT model we have solid numbers for is GPT-3, the largest version of which has 175B parameters; Wikipedia claims it requires 800GB to store, which more or less fits the straightforward 32 bits/​parameter · 175B parameters calculation.

    • 800GB is a magnitude larger than the largest GPU memory size, so multiple GPUs are necessary to run a GPT-3 model.

    • People appear to be quite confident that the later models are even larger: there is a Manifold market that is 88% on GPT-4 having over 1 trillion parameters. I will use GPT-3’s numbers as a placeholder for now, since it is still illustrative.

    • Which GPU might be used? Using recent high end GPUs price points:

      • H100: has 80GB, costs $30k (HPCwire, news article).

      • A100: has 80GB, costs $~18k (CDW).

      • As a point of comparison, the consumer side RTX 4070 has 12GB but retails for around $600.

    • Therefore the initial capital outlay to fully load the model across multiple GPUs is:

      • H100: 10 · $30k = $300k

      • A100: 10 · $18k = $180k

      • RTX 4070: 67 · $600 = $40.2k

  • Therefore breaking even on just the GPU capital outlay can take 2-48 years, depending on which chips are used for which pricing regime, GPT-3.5 or GPT-4. (2 years for getting GPT-4 rates from 67 RTX 4070s, and 48 years for getting GPT-3.5 rates from 10 H100s.)

  • However, the field of AI is moving quickly:

    • Nvidia plans to ramp up production for AI (August 2023, Reuters). Along with the recent spate of 3 new data center GPU architectures in the last 3 years (Wikipedia), it seems likely that Nvidia will continue producing new chip generations.

    • OpenAI is working on GPT-5 (November 2023, Tom’s Guide; original interview is behind the Financial Times paywall). Presumably the new model will use even more resources.

    • With new chips and new models quickly approaching, the lifetime of these current GPUs seems pretty short. Say it takes 8 years to recoup costs, but the GPU’s computing power becomes irrelevant within 4, effectively losing half the cost of the GPU.

  • OpenAI’s prices seem too low to recoup even part of their capital costs in a reasonable time given the volatile nature of the AI industry. Surely I’m missing something obvious?

Other Factors

  • Other factors I didn’t include in the model above, which may make cost/​revenue increase/​decrease:

    • COST+: the model not only has to pay for itself, it needs to pay for training costs/​the data center/​electricity/​staff/​buildings for staff/​free tier queries.

    • COST+: these models aren’t going to be used at 100% efficiency; user numbers will ebb and flow over the course of a day, and GPUs are physical objects with failure rates.

    • COST+: the actual models are likely larger than GPT-3, so the GPU costs would be even larger.

    • REV+: the revenue from input tokens isn’t included, perhaps (pure speculation) this would push revenue higher by 2x?

    • REV+: perhaps the token generation rates we can see are misleading. If a model is time shared aggressively it may be serving 100 tokens/​s to many users at once. This seems somewhat weird to me (why would you (for example) produce 10 tokens for user A, then user B, etc? Wouldn’t you need to constantly re-pay set up costs for each user? Would the gains in fairness/​uniformity of response times really be worth it?), but maybe I’m overestimating how difficult this would be to engineer.

    • COST-: the 800GB model size comes from treating each parameter as a full 32 bit float, but the ML field has been trending towards less precision. If all the parameters are 16 bits, then we cut memory requirements (and therefore GPU costs) in half. 8 bit quantization is also possible, but my understanding is that a full 8 bit model can be unstable (“The main problem with using 8-bit precision is that transformers can get very unstable with so few bits”, buried in this post about DL GPUs), so it seems unlikely that the GPTs are using able to fully cut their size to 1/​4th.

    • COST-: the value of a GPU doesn’t depreciate to $0 as soon as a new GPT version/​Nvidia architecture comes out. As a speculative example, smaller AI shops may be willing to snap up cheaper A100s in 4 years when OpenAI is no longer using them, recouping some costs. If the chip shortage is still ongoing the GPUs may even keep most of their value.

      • However, relying on this seems like a mistake; why would you eat this depreciation risk if you didn’t need to?

    • COST+: surely GPT-3+, the poster children of generative AI, are not scraping together lower end RTX 4070s to do inference? The cost/​GB numbers are good, but I completely ignore basically every other performance metric you may want from a GPU.

    • REV+: perhaps the actual token generation rates aren’t so slow, and fixed costs like network transit time dominate and make token rates look much slower?

      • However, eyeballing the raw data from the Reddit thread it looks like if there is a fixed cost, it isn’t obvious, since generating 100 tokens and 700 tokens are both in the same 11-14 tokens/​s range.

    • COST-: perhaps I’ve simply misunderstood how these large models are run. Instead of running many GPUs in parallel, each model only runs on one GPU. The H100 model with the smallest GPU memory bandwidth can load its full 80GB memory 20 times/​second (2TB/​s, see the H100 PCIe under Product Specifications), enough to theoretically pipe an entire 800GB model through the GPU in half a second.

      • However, even if it works this seems pretty wasteful to me: why not instead run the GPUs in parallel and not pay the memory loading costs over and over? Wouldn’t that lead to much better latency and better hardware utilization?

    • I used retail pricing for everything, while OpenAI is likely sourcing their GPUs cheaper directly from the manufacturer.

      • However, I would also assume that OpenAI offers B2B pricing to large customers as well, so the savings in cost may well be balanced out by lower revenue.

    • OpenAI is still effectively a startup, so it might be fine simply losing lots of money.

No comments.