Nathan Helm-Burger comments on Nathan Helm-Burger’s Shortform

Nathan Helm-Burger 3 Feb 2025 19:00 UTC
0 points
0
[Edit 2: faaaaaaast. https://x.com/jrysana/status/1902194419190706667 ] [Edit: Please also see Nick’s reply below for ways in which this framing lacks nuance and may be misleading if taken at face value.]

https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/

The DeepSeek-R1 NIM microservice can deliver up to 3,872 tokens per second on a single NVIDIA HGX H200 system.

[Edit: that’s throughput including parallel batches, not serial speed! Sorry, my mistake.

Here’s a claim from Cerebras of 2100 tokens/sec serial speed on Llama 80B. https://cerebras.ai/blog/cerebras-inference-3x-faster]

Let’s say a fast human can type around 80 words per minute. A rough average token conversion is 0.75 words per token. Lets call that 110 tokens/min or around 2 tokens/sec. Speed typists can do more than this, but that isn’t generating novel text, it’s just copying and maxing out the physical motions.

That gives us about a [Cerebras 2100 tokens/sec = 1000x] speed factor of AI serial speed over human.

I wonder if there are slowmo videos I can find of humans moving at the speed an AI would perceive them given current tech. I’ve seen lots of slowmo videos, but not tried specifically matching [1:1000].

After a brief bit of looking into this, a typical way this is described is with 24 frames per second as a baseline recording speed. Then recording at a higher speed and playing back at 24 fps gives a slow motion video. In these terms, 1000 fps is approximately a 35x speedup. A 1000x speedup is approximately 24000 fps.
- JBlack 4 Feb 2025 1:31 UTC
  4 points
  2
  Parent
  Let’s say a fast human can type around 80 words per minute. A rough average token conversion is 0.75 tokens per word. Lets call that 110 tokens/sec.
  Isn’t that 110 tokens/min, or about 2 tokens/sec? (I think the tokens/word might be words/token, too)
  - Nathan Helm-Burger 4 Feb 2025 5:16 UTC
    2 points
    0
    Parent
    Oops, yes.
- Nick_Tarleton 3 Feb 2025 19:19 UTC
  4 points
  0
  Parent
  I don’t see how it’s possible to make a useful comparison this way; human and LLM ability profiles, and just the nature of what they’re doing, are too different. An LLM can one-shot tasks that a human would need non-typing time to think about, so in that sense this underestimates the difference, but on a task that’s easy for a human but the LLM can only do with a long chain of thought, it overestimates the difference.
  
  Put differently: the things that LLMs can do with one shot and no CoT imply that they can do a whole lot of cognitive work in a single forward pass, maybe a lot more than a human can ever do in the time it takes to type one word. But that cognitive work doesn’t compound like a human’s; it has to pass through the bottleneck of a single token, and be substantially repeated on each future token (at least without modifications like Coconut).
  
  (Edit: The last sentence isn’t quite right — KV caching means the work doesn’t have to all be recomputed, though I would still say it doesn’t compound.)
  - Nathan Helm-Burger 3 Feb 2025 19:21 UTC
    2 points
    0
    Parent
    Yeah, of course. Just trying to get some kind of rough idea at what point future systems will be starting from.
    - Nick_Tarleton 3 Feb 2025 19:31 UTC
      4 points
      0
      Parent
      I don’t think it’s an outright meaningless comparison, but I think it’s bad enough that it feels misleading or net-negative-for-discourse to describe it the way your comment did. Not sure how to unpack that feeling further.
      - Nathan Helm-Burger 3 Feb 2025 19:37 UTC
        4 points
        0
        Parent
        Well, I upvoted your comment, which I think adds important nuance. I will also edit my shortform to explicitly say to check your comment. Hopefully, the combination of the two is not too misleading. Please add more thoughts as they occur to you about how better to frame this.
- ryan_greenblatt 3 Feb 2025 19:07 UTC
  4 points
  0
  Parent
  This is overall output throughput not latency (which would be output tokens per second for a single context).
  
  a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second. This throughput
  
  This just claims that you can run a bunch of parallel instances of R1.
  - Nick_Tarleton 3 Feb 2025 19:25 UTC
    4 points
    2
    Parent
    https://artificialanalysis.ai/leaderboards/providers claims that Cerebras achieves that OOM performance, for a single prompt, for 70B-parameter models. So nothing as smart as R1 is currently that fast, but some smart things come close.
    - Nathan Helm-Burger 3 Feb 2025 19:30 UTC
      2 points
      0
      Parent
      Yeah, I just found a cerebras post which claims 2100 serial tokens/sec.
  - Nathan Helm-Burger 3 Feb 2025 19:11 UTC
    2 points
    0
    Parent
    Oops, bamboozled. Thanks, I’ll look into it more and edit accordingly.