Gordon Seidoh Worley answers To what extent are the scaling properties of Transformer networks exceptional?

Gordon Seidoh Worley 29 Jul 2020 2:23 UTC
LW: 2 AF: 1
0
AF
Most systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven’t scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network.
Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints.
Transformers may just be special in that they have yet to start hitting diminishing returns because we haven’t yet run up against their coordination bottlenecks, although that doesn’t make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.