gwern comments on From GPT to AGI

gwern 31 Aug 2020 16:43 UTC
9 points

I would expect that with increased model size it will be possible to increase the attention field by a lot without much need for additional AI insight.

It’s not model size/parameters, it’s the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.
- ChristianKl 31 Aug 2020 20:50 UTC
  2 points
  Parent
  What exactly is 30k? When I try to calculate the value for GPT-3 it seems to me like 96 * 96 * 128 = 1179648 (1179k) is the resulting value for the 350 GB model.
  - gwern 31 Aug 2020 21:31 UTC
    2 points
    Parent
    As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
    
    I’m not sure what your calculation is supposed to be.
    - ChristianKl 1 Sep 2020 12:22 UTC
      2 points
      Parent
      The unit for your 30k seems to be BPEs (Byte pair encodings).
      I found on https://www.gwern.net/GPT-3#dialogue:
      The first limit is that it remains hobbled by the limited context window. GPT-3 has no form of memory or recurrence, so it cannot see anything outside its limited 2048 BPEs
      If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
      - gwern 1 Sep 2020 15:59 UTC
        4 points
        Parent
        If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.