As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
I’m not sure what your calculation is supposed to be.
The first limit is that it remains hobbled by the limited context window. GPT-3 has no form of memory or recurrence, so it cannot see anything outside its limited 2048 BPEs
If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.
As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
I’m not sure what your calculation is supposed to be.
The unit for your 30k seems to be BPEs (Byte pair encodings).
I found on https://www.gwern.net/GPT-3#dialogue:
If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.