This is overall output throughput not latency (which would be output tokens per second for a single context).
a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second. This throughput
This just claims that you can run a bunch of parallel instances of R1.
https://artificialanalysis.ai/leaderboards/providers claims that Cerebras achieves that OOM performance, for a single prompt, for 70B-parameter models. So nothing as smart as R1 is currently that fast, but some smart things come close.
This is overall output throughput not latency (which would be output tokens per second for a single context).
This just claims that you can run a bunch of parallel instances of R1.
https://artificialanalysis.ai/leaderboards/providers claims that Cerebras achieves that OOM performance, for a single prompt, for 70B-parameter models. So nothing as smart as R1 is currently that fast, but some smart things come close.
Yeah, I just found a cerebras post which claims 2100 serial tokens/sec.
Oops, bamboozled. Thanks, I’ll look into it more and edit accordingly.