Assume a model takes 3e9 flops to infer the next token, and these chips run as fast as H100s, i.e. 3e15 flops/s. A single chip can infer 1e6 tokens/s. If you have 10M active users, then 100 chips can provide each user a token every 10ms, around 600wpm.
These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)
More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for $1 billion, then spending $1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).
These numbers seem wrong. I think inference flops per token for powerful models is closer to 1e12-1e13. (The same as the number of params for dense models.)
More generally, I think expecting a similar amount of money spent on training as on inference is broadly reasonable. So, if a future powerful model is trained for $1 billion, then spending $1 million to design custom inference chips is fine (though I expect the design cost is higher than this in practice).