Nisan comments on faul_sname’s Shortform

Nisan 8 Feb 2026 4:57 UTC
3 points
0
Hm, I wonder how it works under the hood. Speculative sampling? Faster hardware?
- faul_sname 8 Feb 2026 7:03 UTC
  7 points
  2
  Parent
  Isn’t inference memory bound on kv cache? If that’s the case then I think “smaller batch size” is probably sufficient to explain the faster inference, and the cost per token to Anthropic of 80TPS or 200TPS is not particularly large. But users are willing to pay much more for 200TPS (Anthropic hypothesizes).