I would expect that with increased model size it will be possible to increase the attention field by a lot without much need for additional AI insight.
It’s not model size/parameters, it’s the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.
What exactly is 30k? When I try to calculate the value for GPT-3 it seems to me like 96 * 96 * 128 = 1179648 (1179k) is the resulting value for the 350 GB model.
As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
I’m not sure what your calculation is supposed to be.
The first limit is that it remains hobbled by the limited context window. GPT-3 has no form of memory or recurrence, so it cannot see anything outside its limited 2048 BPEs
If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.
It’s not model size/parameters, it’s the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.
What exactly is 30k? When I try to calculate the value for GPT-3 it seems to me like 96 * 96 * 128 = 1179648 (1179k) is the resulting value for the 350 GB model.
As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
I’m not sure what your calculation is supposed to be.
The unit for your 30k seems to be BPEs (Byte pair encodings).
I found on https://www.gwern.net/GPT-3#dialogue:
If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.