gwern comments on From GPT to AGI

gwern 31 Aug 2020 21:31 UTC
2 points
As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.

I’m not sure what your calculation is supposed to be.
- ChristianKl 1 Sep 2020 12:22 UTC
  2 points
  Parent
  The unit for your 30k seems to be BPEs (Byte pair encodings).
  I found on https://www.gwern.net/GPT-3#dialogue:
  The first limit is that it remains hobbled by the limited context window. GPT-3 has no form of memory or recurrence, so it cannot see anything outside its limited 2048 BPEs
  If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
  - gwern 1 Sep 2020 15:59 UTC
    4 points
    Parent
    If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.