ryan_greenblatt comments on Memory bandwidth constraints imply economies of scale in AI inference

ryan_greenblatt 17 Dec 2023 1:09 UTC
2 points
0

Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.

These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)
- ryan_greenblatt 17 Dec 2023 1:12 UTC
  5 points
  0
  Parent
  More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for $1 billion, then spending $1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).