ryan_greenblatt comments on Nathan Helm-Burger’s Shortform

ryan_greenblatt 3 Feb 2025 19:07 UTC
4 points
0
This is overall output throughput not latency (which would be output tokens per second for a single context).

a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second. This throughput

This just claims that you can run a bunch of parallel instances of R1.
- Nick_Tarleton 3 Feb 2025 19:25 UTC
  4 points
  2
  Parent
  https://artificialanalysis.ai/leaderboards/providers claims that Cerebras achieves that OOM performance, for a single prompt, for 70B-parameter models. So nothing as smart as R1 is currently that fast, but some smart things come close.
  - Nathan Helm-Burger 3 Feb 2025 19:30 UTC
    2 points
    0
    Parent
    Yeah, I just found a cerebras post which claims 2100 serial tokens/sec.
- Nathan Helm-Burger 3 Feb 2025 19:11 UTC
  2 points
  0
  Parent
  Oops, bamboozled. Thanks, I’ll look into it more and edit accordingly.