The lower bound of “memory bandwidth vs. model size” is effectively equivalent to assuming that the batch size is a single token. I think this isn’t at all close to realistic operating conditions and thus won’t be a very tight lower bound. (Or reflect the most important bottlenecks.)
I think that the KV cache for a single sequence won’t be larger than the model weights for realistic work loads, so the lower bound should still be a valid lower bound. (Though not a tight one.)
I think the bottom line number you provide for “rough estimate of actual throughput” ends up being pretty reasonable for output tokens and considerably too low for input tokens. (I think input tokens probably get more like 50% or 75% flop utilization rather than 15%. See also the difference in price for anthropic model.)
That said, it doesn’t seem like a good mechanism for estimating throughput will be to aggregate the lower and upper bounds you have as the lower bound doesn’t have much correspondence with actual bottlenecks. (For instance, this lower bound would miss that mamba would get much higher throughput.)
I also think that insofar as you care about factors of 3-5 on inference efficiency, you need to do different analysis for input tokens and output tokens.
(I also think that input tokens get pretty close to the pure FLOP estimate. So, another estimation approach you can use if you don’t care about factors of 5 is to just take the pure flop estimate and then halve it to be account for other slow downs. I think this estimate gets input tokens basically right and is wrong by a factor of 3-5 for output tokens.)
It seems like your actual mechanism for making this estimate for the utilization on output tokens was to take the number from semi-analysis and extrapolate it to other GPUs. (At least the number matches this?) This does seem like a reasonable approach, but it isn’t particularly tethered to your lower bound.
I agree the lower bound for output isn’t very tight. I’d be very interested to hear other simple rules of thumb you could use to provide a tighter one.
I’ll add a note to the section on input tokens that since they don’t require KV cache, it’s possible to get much closer to the upper bound.
The lower bound of “memory bandwidth vs. model size” is effectively equivalent to assuming that the batch size is a single token. I think this isn’t at all close to realistic operating conditions and thus won’t be a very tight lower bound. (Or reflect the most important bottlenecks.)
I think that the KV cache for a single sequence won’t be larger than the model weights for realistic work loads, so the lower bound should still be a valid lower bound. (Though not a tight one.)
I think the bottom line number you provide for “rough estimate of actual throughput” ends up being pretty reasonable for output tokens and considerably too low for input tokens. (I think input tokens probably get more like 50% or 75% flop utilization rather than 15%. See also the difference in price for anthropic model.)
That said, it doesn’t seem like a good mechanism for estimating throughput will be to aggregate the lower and upper bounds you have as the lower bound doesn’t have much correspondence with actual bottlenecks. (For instance, this lower bound would miss that mamba would get much higher throughput.)
I also think that insofar as you care about factors of 3-5 on inference efficiency, you need to do different analysis for input tokens and output tokens.
(I also think that input tokens get pretty close to the pure FLOP estimate. So, another estimation approach you can use if you don’t care about factors of 5 is to just take the pure flop estimate and then halve it to be account for other slow downs. I think this estimate gets input tokens basically right and is wrong by a factor of 3-5 for output tokens.)
It seems like your actual mechanism for making this estimate for the utilization on output tokens was to take the number from semi-analysis and extrapolate it to other GPUs. (At least the number matches this?) This does seem like a reasonable approach, but it isn’t particularly tethered to your lower bound.
I agree the lower bound for output isn’t very tight. I’d be very interested to hear other simple rules of thumb you could use to provide a tighter one.
I’ll add a note to the section on input tokens that since they don’t require KV cache, it’s possible to get much closer to the upper bound.