Isn’t inference memory bound on kv cache? If that’s the case then I think “smaller batch size” is probably sufficient to explain the faster inference, and the cost per token to Anthropic of 80TPS or 200TPS is not particularly large. But users are willing to pay much more for 200TPS (Anthropic hypothesizes).
Hm, I wonder how it works under the hood. Speculative sampling? Faster hardware?
Isn’t inference memory bound on kv cache? If that’s the case then I think “smaller batch size” is probably sufficient to explain the faster inference, and the cost per token to Anthropic of 80TPS or 200TPS is not particularly large. But users are willing to pay much more for 200TPS (Anthropic hypothesizes).