almost all tokens are input tokens (for example Sonnet 4 gets 98% of API tokens as input tokens).
[EDIT: I was being silly and not considering the entire input being fed back to the LLM as context]
A bit tangential, but this is really wild. I’m confident that can’t be true for chat sessions, so it has to be API usage. It does seem to vary substantially by model, eg for R1 it looks to be more like 80–85%. It makes some sense that it would be especially high for Anthropic, since they’ve focused on the enterprise market.
What kinds of use cases are nearly all input? The main one that I immediately see is using LLMs essentially as classifiers; eg you show an LLM a post by a user and ask whether it’s toxic, or show it a bunch of log data and ask whether it contains errors. Maybe coding is another, where the LLM reads lots of code and then writes a much smaller amount, or points out bugs. Are there others that seem plausibly very common?
I suspect the models’ output tokens become input tokens when the conversation proceeds to the next turn; certainly my API statistics show several times as many input tokens as output despite the fact that my responses are invariably shorter than the models’.
Thanks, that’s a really great point, and I feel silly for not considering it.
I’m not entirely sure offhand how to model that. Naively, the entire conversation is fed in as context on each forward pass, but I think KV caching means that’s not entirely the right way to model it (and how cached tokens are billed probably varies by model and provider). I’m also not sure to what extent those caches persist across conversational turns (and even if they are, presumably they’re dropped once the user hasn’t responded again for some length of time).
[EDIT: I was being silly and not considering the entire input being fed back to the LLM as context]
A bit tangential, but this is really wild. I’m confident that can’t be true for chat sessions, so it has to be API usage. It does seem to vary substantially by model, eg for R1 it looks to be more like 80–85%. It makes some sense that it would be especially high for Anthropic, since they’ve focused on the enterprise market.
What kinds of use cases are nearly all input? The main one that I immediately see is using LLMs essentially as classifiers; eg you show an LLM a post by a user and ask whether it’s toxic, or show it a bunch of log data and ask whether it contains errors. Maybe coding is another, where the LLM reads lots of code and then writes a much smaller amount, or points out bugs. Are there others that seem plausibly very common?
I suspect the models’ output tokens become input tokens when the conversation proceeds to the next turn; certainly my API statistics show several times as many input tokens as output despite the fact that my responses are invariably shorter than the models’.
Thanks, that’s a really great point, and I feel silly for not considering it.
I’m not entirely sure offhand how to model that. Naively, the entire conversation is fed in as context on each forward pass, but I think KV caching means that’s not entirely the right way to model it (and how cached tokens are billed probably varies by model and provider). I’m also not sure to what extent those caches persist across conversational turns (and even if they are, presumably they’re dropped once the user hasn’t responded again for some length of time).