This work was done as part of the MIRI Technical Governance Team. It reflects my views and may not reflect those of the organization.
Summary
I performed some quick analysis of the pricing offered by different LLM providers using public data from ArtificialAnalysis. These are the main results:
Pricing for the same model differs substantially across providers, often with a price range of 10x.
For a given provider (such as AWS Bedrock), there is notable variation in price at a given model size. Model size is still a good predictor of price.
I estimate that many proprietary models are sold at a substantial mark-up. This is not surprising, given development costs and limited competition to serve the model.
Mixture-of-Experts (MoE) models, which activate only a subset of parameters during inference, are often billed similarly to a dense model sized between the active and total parameter counts—typically leaning toward the higher, total parameter count.
I looked into this topic as part of a shallow dive on inference costs. I’m interested in how inference costs have changed over time and what these trends might imply about a potential intelligence explosion. However, this analysis shows a large variation in sticker prices, even to serve the same model, so sticker prices are likely not a very precise tool for studying inference cost trends. This project was done in a few days and should be seen as “a shallow dive with some interesting findings” rather than a thorough report.
Data about real-world LLM inference economics is hard to come by, and the results here do not tell a complete or thorough picture. I’m partially publishing this in the spirit of Cunningham’s Law: if you have firsthand knowledge about inference economics, I would love to hear it (including privately)!
Pricing differs substantially across providers when serving the same model
Open-weight LLMs are a commodity: whether Microsoft Azure is hosting the model or Together.AI is hosting the model, it’s the exact same model. Therefore, we might naively expect the price for a particular model to be about the same across different providers—gas prices are similar across different gas stations. But we actually see something very different. For a given model, prices often range by 10x or more from the cheapest provider to the most expensive provider.
ArtificialAnalysis lists inference prices charged by various companies, allowing easy comparison across providers. Initially, we’ll look at models that are fully open-weight (anybody can serve them without a license). This image from ArtificialAnalysis shows how different providers stack up for serving the Llama-3.1-70B Instruct model:[1]
The x-axis is the price. The cheapest provider is offering this model for a price of $0.20 per million tokens, the median provider is at ~$0.70, and the most expensive provider is charging $2.90.
This amount of price variation appears quite common. Here’s the analogous chart for the Llama-3.1-405B Instruct model:
The cheapest provider is offering this model for a price of $0.90 per million tokens, the median provider is at ~$3.50, and the most expensive provider is charging $9.50. There is somewhat less spread for models with fewer providers. For instance, the Mixtral-8x22B-instruct model has five providers whose prices range from $0.60 to $3.00.
Overall, LLMs do not appear to be priced like other commodities. There are huge differences in the price offered by different providers and this appears to be true for basically all open-weight models.
Pricing differs for models in the same size class, even for a particular provider
One natural question to ask is, how well does model size predict price? Intuitively, the size of the model should be one of the main determinants of cost, and we might expect similarly sized models to be priced similarly. First, let’s look at fully open-weight dense models on AWS Bedrock.
Note that this data is thrown off by Llama 3 70B, a model priced at $2.86. By contrast, Llama 3.1 70B is priced at only $0.72. But, overall, what we’re seeing here is that model size is fairly predictive of price, and there is some variation between models of the same size (even excluding this outlier).
Let’s also look at prices on Deep Infra, a provider that tends to have relatively low prices.
Without any odd outliers, we see a much nicer fit, with an R^2 value of 0.904. For those curious, a quadratic fits this data somewhat better, with an R^2 of 0.957. Some of the models around the 70B size class are: Llama 3.3 70B (Turbo FP8) ($0.20), Llama 3.3 70B ($0.27), Llama 3.1 70B ($0.27), Qwen2.5 72B ($0.27), and Qwen2 72B ($0.36).
So on Deep Infra we see the same trend: model size does predict price pretty well, but there is also variation in price at a given model size.
Estimating marginal costs of inference
I’m interested in understanding the trends in LLM inference costs over time because this could help predict dynamics of an intelligence explosion or AI capabilities diffusion. Therefore, my original motivation for looking at inference prices was the hope that these would be predictive of underlying costs. Unfortunately, providers can upcharge models to make profit, so prices might not be useful for predicting costs.
However, simple microeconomics comes to the rescue! In a competitive market with many competing firms, we should expect prices to approach marginal costs. If prices are substantially above marginal costs, money can be made by new firms undercutting the competition. We can estimate marginal cost based on the minimum price for a given model across all providers. This approach makes the assumption that the cheapest providers are still breaking even, an assumption that may not hold.[2]
Let’s look at some of the dense, open-weight models that have many providers, and compare the size of the model to the lowest price from any provider. We’ll look at the models with the most competition (providers) and add a few other models for size variation. Across 9 models we have a mean of 8.7 providers.
Here’s that same graph, but zoomed in to only show the smaller models:
Model size is again a good predictor of price, even better than when we looked at a single provider above. And look at how cheap the models are! Eight billion parameter models are being served at $0.03, 70B models at $0.20, and the Llama 405B model at $0.90.
If we assume these minimum prices approximate the marginal cost of serving a model—again, a somewhat dubious assumption—we can predict costs for larger models. The best-fit line for these prices implies a fixed cost of $0.03 and a variable cost of $0.02 per ten billion parameters. This would predict that a 10 trillion parameter dense model would cost about $22 per million tokens. Alternatively, by using the best-fit line from AWS’s prices, we get that AWS might offer a 10 trillion parameter model for about $60.
Proprietary models are probably marked up significantly
Can we also use these trends to predict the size of proprietary models? Not exactly. Proprietary models don’t have a nice dynamic of price competition—instead providers can charge much higher rates. But these minimum prices are still useful because they basically tell us “this is the largest model size that the market knows how to serve at some price without losing money”, at least if we’re assuming nobody is losing money. This turns out to not be a very interesting analysis, so I’ll be brief. Here are the expected maximum dense parameter counts for some proprietary models based on their price: Claude 3.5 Sonnet: $6.00, 2.8 trillion parameters, Claude 3 Haiku: $0.50, 217 billion parameters, GPT-4o (2024-08-06): $4.38, 2 trillion parameters, Gemini 1.0 Pro (AI Studio): $0.75, 332 billion parameters. Based on model evaluations and vibes, I expect these are higher than the actual parameter counts, except for perhaps Gemini.
Let’s look at the original GPT-4 as a case study. The model was priced at $37.50, and it is rumored to be a Mixture-of-Experts model. According to this analysis of MoE inference costs, the model should cost about as much as a 950 billion parameter dense model. We’ll assume that the cost of serving such a model today is about the same as it was in early 2023 when GPT-4 was first available (a very dubious assumption). Then we can compare the expected price to serve the model today (under various conditions) against the price charged in March 2023. Using the minimum-provider-price trend, we get that the model was priced at about 18x marginal cost; using Azure’s price trend, we get that the model was priced at about 2x marginal cost; and using TogetherAI’s price trend, we get that the model was priced at about 4.5x marginal cost. There’s substantial uncertainty in these numbers, and we know of many ways that it has gotten cheaper to serve models over time (e.g., FlashAttention). But because it’s useful to have a general idea, I think the original GPT-4 was likely served at somewhere between 1x and 10x marginal cost.
MoE models are generally priced between their active and total parameter count
How do the prices for Mixture-of-Experts (MoE) models compare to standard dense models? Fortunately, there are a few MoE models that have multiple providers, so we can apply similar reasoning based on the lowest cost from any of these providers. Let’s look at the data from above but adding in MoE models. Each MoE model will get two data points, one for its active—or per token—parameter count, and one for its total parameter count. We’re basically asking, “for the price of an MoE model, what would be the dense-model parameter equivalent, and how does that compare to the active and total parameter counts.”
A couple quick notes on the data here. The DBRX model only has two model providers, and it requires a license for large companies to serve, so it could be overpriced. The DeepSeek-V3 model also has a provider offering it for cheaper ($0.25) at FP8 quantization, which we exclude.
The trendline for dense models falls between the active and total parameter count for 3 of the 5 MoE models. The other two MoE models are over-priced compared to the dense trends, even when looking at their total parameter count. Assuming this price reflects marginal cost, this data would indicate that MoE models have a similar cost to a dense model somewhere between their active and total parameter count, perhaps closer to the total parameter count. However, the data is quite noisy and there are fewer providers for these MoE models than the dense models (mean of 6 and 8.7 respectively).
We can also replicate this analysis for a single model provider, and we get similar results. On Nebius, Deep Infra, and Together.AI the trend line for dense model prices indicates that MoE models have a dense-model-equivalent cost somewhere between their active and total parameter count, or a bit higher than total parameter count. Here’s the graph for Nebius:
Input and output token prices
The ratio between prices for input and output tokens varies across providers and models, generally falling between 1x and 4x (i.e., output tokens are at least as expensive as input tokens, but less than 4 times as expensive). There are a few providers that price input and output tokens equally, including relatively cheap providers. This price equivalence is somewhat surprising given that the models most people use—OpenAI, Anthropic, and Google models—price output tokens as 3-5x more expensive than input tokens. It’s generally believed that input tokens should be cheaper to process than output tokens because inputs can be processed in parallel with good efficiency.
Hypotheses about price variance
The large price range for serving a particular model is deeply confusing. Imagine you went to the gas station and the price was $4.00, and you look across the street at another gas station and the price is $40.00—that’s basically the situation we currently see with LLM inference. Open-weight LLMs are a commodity, it’s the same thing being sold by Azure as by Together.AI! I discussed this situation with a few people and currently have a few hypotheses for why prices might differ so much for providing inference.
First, let’s talk about the demand side, specifically the fact that inference on open-weight models is not exactly a commodity. There are a few key measures a customer might care about that could differ for a particular model:
Price (input, output, total based on use case).
Rate limits and availability.
Uptime.
Speed (time to first token, output speed or time per output token, and total response time which is a combination of these). On a brief look at the speed vs. cost relationship, there are some models where cheaper providers are slower—as expected—but, there are some models where this is not the case.
Context length.
Is it actually the same model? It’s possible that some providers are slightly modifying a model that they are serving for inference, for example by quantizing some of the computation, so that the model differs slightly across providers. ArtificialAnalysis indicates that some providers are doing this (and I avoided counting those models in the minimum-price analysis), but others could be doing this secretly and users may not know.
Outside of considerations about the model being served, there might be other reasons providers differ, from a customer’s perspective:
Cheaper LLM providers might be unreliable or otherwise worse to do business with.
Existing corporate deals might make it easier to use pricier LLM providers such as Azure and AWS. If you are an employee and your company already uses AWS for many services, this could be the most straightforward option.
Maybe the switching costs for providers are high and customers therefore get locked into using relatively expensive providers. This strikes me as somewhat unlikely given that switching is often as simple as replacing a few lines of code.
Maybe customers don’t bother to shop around for other providers. Shopping around can be slightly annoying (some providers make it difficult to find the relevant information). Inference expenses might also be low, in absolute terms, for some customers, such that shopping around isn’t worthwhile (though this doesn’t seem like it should apply at a macro-level).
On the supply side, the price is going to be affected by how efficiently one can serve the model and how much providers want to profit (or how much they’re willing to lose). There are various factors I expect affect provider costs:
Hardware infrastructure. Different AI chips face different total cost of ownership and different opportunity costs. There is also substantial variation in prices for particular hardware (e.g., the price to rent H100 GPUs). Other AI hardware, such as interconnect, could also affect the efficiency with which different providers can serve models.
Software infrastructure. There are likely many optimizations that make it cheaper to serve a particular model, such as writing efficient CUDA kernels, using efficient parallelism, and using speculative decoding effectively. These could change much more quickly than hardware infrastructure. Some infrastructure likely benefits from economies of scale, e.g., it only makes sense to hire the inference optimization team if there is lots of inference for them to optimize.
Various tradeoffs on key metrics. As mentioned, providers vary in the speed at which they serve models. Some providers may simply choose to offer models at different points along these various tradeoff curves.
The sticker prices could also be unreliable. First, sticker prices do not indicate usage, and it is possible that expensive providers don’t have very much traffic. Second, providers could offer substantial discounts below sticker price, for instance Google offers some models completely for free (excluded from the minimum-price analysis) and gives cloud credits to many customers.
Future analysis could attempt to create a comprehensive model for predicting inference prices. I expect the following variables will be key inputs to a successful model (but the data might still be too noisy): model size, number of competitors serving the model, and model performance relative to other models. The various other measures discussed above could also be useful. I expect model performance explains much of LLM pricing (e.g., that of Claude 3.5 Haiku); per ArtificialAnalysis:
Final thoughts
This investigation revealed numerous interesting facts about the current LLM inference sector. Unfortunately, it is difficult to make strong conclusions about the underlying costs of LLM inference because prices range substantially across providers.
The data used in this analysis is narrow, so I recommend against coming to strong conclusions solely on its basis. Here are the spreadsheets used, based on data from ArtificialAnalysis. Again, please reach out if you have firsthand information about inference costs that you would like to share.
Thank you to Tom Shlomi, Tao Lin, and Peter Barnett for discussion.
The data in this post is collected from ArtificialAnalysis in December 2024 or January 2025. Prices are for a 3:1 input:output blend of one million tokens. This spreadsheet includes most of the analysis and graphs.
One reason to expect the cheapest model providers to be losing money is that this is likely an acceptable business plan for many of them. It is very common for businesses to initially operate at a loss in order to gain market share. Later they may raise prices or reduce their costs via returns to scale. The magnitude of these losses is not large yet: the cost of LLM inference in 2024 (excluding model development) was likely in the single digit billions of dollars. This is relatively small compared to the hundreds of billions of AI CapEx mainly going toward future data centers; it is plausible that some companies would just eat substantial losses on inference.
Observations About LLM Inference Pricing
Link post
This work was done as part of the MIRI Technical Governance Team. It reflects my views and may not reflect those of the organization.
Summary
I performed some quick analysis of the pricing offered by different LLM providers using public data from ArtificialAnalysis. These are the main results:
Pricing for the same model differs substantially across providers, often with a price range of 10x.
For a given provider (such as AWS Bedrock), there is notable variation in price at a given model size. Model size is still a good predictor of price.
I estimate that many proprietary models are sold at a substantial mark-up. This is not surprising, given development costs and limited competition to serve the model.
Mixture-of-Experts (MoE) models, which activate only a subset of parameters during inference, are often billed similarly to a dense model sized between the active and total parameter counts—typically leaning toward the higher, total parameter count.
I looked into this topic as part of a shallow dive on inference costs. I’m interested in how inference costs have changed over time and what these trends might imply about a potential intelligence explosion. However, this analysis shows a large variation in sticker prices, even to serve the same model, so sticker prices are likely not a very precise tool for studying inference cost trends. This project was done in a few days and should be seen as “a shallow dive with some interesting findings” rather than a thorough report.
Data about real-world LLM inference economics is hard to come by, and the results here do not tell a complete or thorough picture. I’m partially publishing this in the spirit of Cunningham’s Law: if you have firsthand knowledge about inference economics, I would love to hear it (including privately)!
Pricing differs substantially across providers when serving the same model
Open-weight LLMs are a commodity: whether Microsoft Azure is hosting the model or Together.AI is hosting the model, it’s the exact same model. Therefore, we might naively expect the price for a particular model to be about the same across different providers—gas prices are similar across different gas stations. But we actually see something very different. For a given model, prices often range by 10x or more from the cheapest provider to the most expensive provider.
ArtificialAnalysis lists inference prices charged by various companies, allowing easy comparison across providers. Initially, we’ll look at models that are fully open-weight (anybody can serve them without a license). This image from ArtificialAnalysis shows how different providers stack up for serving the Llama-3.1-70B Instruct model:[1]
This amount of price variation appears quite common. Here’s the analogous chart for the Llama-3.1-405B Instruct model:
The cheapest provider is offering this model for a price of $0.90 per million tokens, the median provider is at ~$3.50, and the most expensive provider is charging $9.50. There is somewhat less spread for models with fewer providers. For instance, the Mixtral-8x22B-instruct model has five providers whose prices range from $0.60 to $3.00.
Overall, LLMs do not appear to be priced like other commodities. There are huge differences in the price offered by different providers and this appears to be true for basically all open-weight models.
Pricing differs for models in the same size class, even for a particular provider
One natural question to ask is, how well does model size predict price? Intuitively, the size of the model should be one of the main determinants of cost, and we might expect similarly sized models to be priced similarly. First, let’s look at fully open-weight dense models on AWS Bedrock.
Note that this data is thrown off by Llama 3 70B, a model priced at $2.86. By contrast, Llama 3.1 70B is priced at only $0.72. But, overall, what we’re seeing here is that model size is fairly predictive of price, and there is some variation between models of the same size (even excluding this outlier).
Let’s also look at prices on Deep Infra, a provider that tends to have relatively low prices.
Without any odd outliers, we see a much nicer fit, with an R^2 value of 0.904. For those curious, a quadratic fits this data somewhat better, with an R^2 of 0.957. Some of the models around the 70B size class are: Llama 3.3 70B (Turbo FP8) ($0.20), Llama 3.3 70B ($0.27), Llama 3.1 70B ($0.27), Qwen2.5 72B ($0.27), and Qwen2 72B ($0.36).
So on Deep Infra we see the same trend: model size does predict price pretty well, but there is also variation in price at a given model size.
Estimating marginal costs of inference
I’m interested in understanding the trends in LLM inference costs over time because this could help predict dynamics of an intelligence explosion or AI capabilities diffusion. Therefore, my original motivation for looking at inference prices was the hope that these would be predictive of underlying costs. Unfortunately, providers can upcharge models to make profit, so prices might not be useful for predicting costs.
However, simple microeconomics comes to the rescue! In a competitive market with many competing firms, we should expect prices to approach marginal costs. If prices are substantially above marginal costs, money can be made by new firms undercutting the competition. We can estimate marginal cost based on the minimum price for a given model across all providers. This approach makes the assumption that the cheapest providers are still breaking even, an assumption that may not hold.[2]
Let’s look at some of the dense, open-weight models that have many providers, and compare the size of the model to the lowest price from any provider. We’ll look at the models with the most competition (providers) and add a few other models for size variation. Across 9 models we have a mean of 8.7 providers.
Here’s that same graph, but zoomed in to only show the smaller models:
Model size is again a good predictor of price, even better than when we looked at a single provider above. And look at how cheap the models are! Eight billion parameter models are being served at $0.03, 70B models at $0.20, and the Llama 405B model at $0.90.
If we assume these minimum prices approximate the marginal cost of serving a model—again, a somewhat dubious assumption—we can predict costs for larger models. The best-fit line for these prices implies a fixed cost of $0.03 and a variable cost of $0.02 per ten billion parameters. This would predict that a 10 trillion parameter dense model would cost about $22 per million tokens. Alternatively, by using the best-fit line from AWS’s prices, we get that AWS might offer a 10 trillion parameter model for about $60.
Proprietary models are probably marked up significantly
Can we also use these trends to predict the size of proprietary models? Not exactly. Proprietary models don’t have a nice dynamic of price competition—instead providers can charge much higher rates. But these minimum prices are still useful because they basically tell us “this is the largest model size that the market knows how to serve at some price without losing money”, at least if we’re assuming nobody is losing money. This turns out to not be a very interesting analysis, so I’ll be brief. Here are the expected maximum dense parameter counts for some proprietary models based on their price: Claude 3.5 Sonnet: $6.00, 2.8 trillion parameters, Claude 3 Haiku: $0.50, 217 billion parameters, GPT-4o (2024-08-06): $4.38, 2 trillion parameters, Gemini 1.0 Pro (AI Studio): $0.75, 332 billion parameters. Based on model evaluations and vibes, I expect these are higher than the actual parameter counts, except for perhaps Gemini.
Let’s look at the original GPT-4 as a case study. The model was priced at $37.50, and it is rumored to be a Mixture-of-Experts model. According to this analysis of MoE inference costs, the model should cost about as much as a 950 billion parameter dense model. We’ll assume that the cost of serving such a model today is about the same as it was in early 2023 when GPT-4 was first available (a very dubious assumption). Then we can compare the expected price to serve the model today (under various conditions) against the price charged in March 2023. Using the minimum-provider-price trend, we get that the model was priced at about 18x marginal cost; using Azure’s price trend, we get that the model was priced at about 2x marginal cost; and using TogetherAI’s price trend, we get that the model was priced at about 4.5x marginal cost. There’s substantial uncertainty in these numbers, and we know of many ways that it has gotten cheaper to serve models over time (e.g., FlashAttention). But because it’s useful to have a general idea, I think the original GPT-4 was likely served at somewhere between 1x and 10x marginal cost.
MoE models are generally priced between their active and total parameter count
How do the prices for Mixture-of-Experts (MoE) models compare to standard dense models? Fortunately, there are a few MoE models that have multiple providers, so we can apply similar reasoning based on the lowest cost from any of these providers. Let’s look at the data from above but adding in MoE models. Each MoE model will get two data points, one for its active—or per token—parameter count, and one for its total parameter count. We’re basically asking, “for the price of an MoE model, what would be the dense-model parameter equivalent, and how does that compare to the active and total parameter counts.”
A couple quick notes on the data here. The DBRX model only has two model providers, and it requires a license for large companies to serve, so it could be overpriced. The DeepSeek-V3 model also has a provider offering it for cheaper ($0.25) at FP8 quantization, which we exclude.
The trendline for dense models falls between the active and total parameter count for 3 of the 5 MoE models. The other two MoE models are over-priced compared to the dense trends, even when looking at their total parameter count. Assuming this price reflects marginal cost, this data would indicate that MoE models have a similar cost to a dense model somewhere between their active and total parameter count, perhaps closer to the total parameter count. However, the data is quite noisy and there are fewer providers for these MoE models than the dense models (mean of 6 and 8.7 respectively).
We can also replicate this analysis for a single model provider, and we get similar results. On Nebius, Deep Infra, and Together.AI the trend line for dense model prices indicates that MoE models have a dense-model-equivalent cost somewhere between their active and total parameter count, or a bit higher than total parameter count. Here’s the graph for Nebius:
Input and output token prices
The ratio between prices for input and output tokens varies across providers and models, generally falling between 1x and 4x (i.e., output tokens are at least as expensive as input tokens, but less than 4 times as expensive). There are a few providers that price input and output tokens equally, including relatively cheap providers. This price equivalence is somewhat surprising given that the models most people use—OpenAI, Anthropic, and Google models—price output tokens as 3-5x more expensive than input tokens. It’s generally believed that input tokens should be cheaper to process than output tokens because inputs can be processed in parallel with good efficiency.
Hypotheses about price variance
The large price range for serving a particular model is deeply confusing. Imagine you went to the gas station and the price was $4.00, and you look across the street at another gas station and the price is $40.00—that’s basically the situation we currently see with LLM inference. Open-weight LLMs are a commodity, it’s the same thing being sold by Azure as by Together.AI! I discussed this situation with a few people and currently have a few hypotheses for why prices might differ so much for providing inference.
First, let’s talk about the demand side, specifically the fact that inference on open-weight models is not exactly a commodity. There are a few key measures a customer might care about that could differ for a particular model:
Price (input, output, total based on use case).
Rate limits and availability.
Uptime.
Speed (time to first token, output speed or time per output token, and total response time which is a combination of these). On a brief look at the speed vs. cost relationship, there are some models where cheaper providers are slower—as expected—but, there are some models where this is not the case.
Context length.
Is it actually the same model? It’s possible that some providers are slightly modifying a model that they are serving for inference, for example by quantizing some of the computation, so that the model differs slightly across providers. ArtificialAnalysis indicates that some providers are doing this (and I avoided counting those models in the minimum-price analysis), but others could be doing this secretly and users may not know.
Outside of considerations about the model being served, there might be other reasons providers differ, from a customer’s perspective:
Cheaper LLM providers might be unreliable or otherwise worse to do business with.
Existing corporate deals might make it easier to use pricier LLM providers such as Azure and AWS. If you are an employee and your company already uses AWS for many services, this could be the most straightforward option.
Maybe the switching costs for providers are high and customers therefore get locked into using relatively expensive providers. This strikes me as somewhat unlikely given that switching is often as simple as replacing a few lines of code.
Maybe customers don’t bother to shop around for other providers. Shopping around can be slightly annoying (some providers make it difficult to find the relevant information). Inference expenses might also be low, in absolute terms, for some customers, such that shopping around isn’t worthwhile (though this doesn’t seem like it should apply at a macro-level).
On the supply side, the price is going to be affected by how efficiently one can serve the model and how much providers want to profit (or how much they’re willing to lose). There are various factors I expect affect provider costs:
Hardware infrastructure. Different AI chips face different total cost of ownership and different opportunity costs. There is also substantial variation in prices for particular hardware (e.g., the price to rent H100 GPUs). Other AI hardware, such as interconnect, could also affect the efficiency with which different providers can serve models.
Software infrastructure. There are likely many optimizations that make it cheaper to serve a particular model, such as writing efficient CUDA kernels, using efficient parallelism, and using speculative decoding effectively. These could change much more quickly than hardware infrastructure. Some infrastructure likely benefits from economies of scale, e.g., it only makes sense to hire the inference optimization team if there is lots of inference for them to optimize.
Various tradeoffs on key metrics. As mentioned, providers vary in the speed at which they serve models. Some providers may simply choose to offer models at different points along these various tradeoff curves.
The sticker prices could also be unreliable. First, sticker prices do not indicate usage, and it is possible that expensive providers don’t have very much traffic. Second, providers could offer substantial discounts below sticker price, for instance Google offers some models completely for free (excluded from the minimum-price analysis) and gives cloud credits to many customers.
Future analysis could attempt to create a comprehensive model for predicting inference prices. I expect the following variables will be key inputs to a successful model (but the data might still be too noisy): model size, number of competitors serving the model, and model performance relative to other models. The various other measures discussed above could also be useful. I expect model performance explains much of LLM pricing (e.g., that of Claude 3.5 Haiku); per ArtificialAnalysis:
Final thoughts
This investigation revealed numerous interesting facts about the current LLM inference sector. Unfortunately, it is difficult to make strong conclusions about the underlying costs of LLM inference because prices range substantially across providers.
The data used in this analysis is narrow, so I recommend against coming to strong conclusions solely on its basis. Here are the spreadsheets used, based on data from ArtificialAnalysis. Again, please reach out if you have firsthand information about inference costs that you would like to share.
Thank you to Tom Shlomi, Tao Lin, and Peter Barnett for discussion.
The data in this post is collected from ArtificialAnalysis in December 2024 or January 2025. Prices are for a 3:1 input:output blend of one million tokens. This spreadsheet includes most of the analysis and graphs.
One reason to expect the cheapest model providers to be losing money is that this is likely an acceptable business plan for many of them. It is very common for businesses to initially operate at a loss in order to gain market share. Later they may raise prices or reduce their costs via returns to scale. The magnitude of these losses is not large yet: the cost of LLM inference in 2024 (excluding model development) was likely in the single digit billions of dollars. This is relatively small compared to the hundreds of billions of AI CapEx mainly going toward future data centers; it is plausible that some companies would just eat substantial losses on inference.