I suspect the models’ output tokens become input tokens when the conversation proceeds to the next turn; certainly my API statistics show several times as many input tokens as output despite the fact that my responses are invariably shorter than the models’.
Thanks, that’s a really great point, and I feel silly for not considering it.
I’m not entirely sure offhand how to model that. Naively, the entire conversation is fed in as context on each forward pass, but I think KV caching means that’s not entirely the right way to model it (and how cached tokens are billed probably varies by model and provider). I’m also not sure to what extent those caches persist across conversational turns (and even if they are, presumably they’re dropped once the user hasn’t responded again for some length of time).
I suspect the models’ output tokens become input tokens when the conversation proceeds to the next turn; certainly my API statistics show several times as many input tokens as output despite the fact that my responses are invariably shorter than the models’.
Thanks, that’s a really great point, and I feel silly for not considering it.
I’m not entirely sure offhand how to model that. Naively, the entire conversation is fed in as context on each forward pass, but I think KV caching means that’s not entirely the right way to model it (and how cached tokens are billed probably varies by model and provider). I’m also not sure to what extent those caches persist across conversational turns (and even if they are, presumably they’re dropped once the user hasn’t responded again for some length of time).