[Question] Probability that other architectures will scale as well as Transformers?

GPT-1, 2, and 3 have shown im­pres­sive scal­ing prop­er­ties. How likely is it that, in the next five years, many other ar­chi­tec­tures will also be shown to get sub­stan­tially bet­ter as they get big­ger? EDIT I am open to dis­cus­sion of bet­ter defi­ni­tions of the scal­ing hy­poth­e­sis. For ex­am­ple, maybe Gw­ern means some­thing differ­ent here in which case I’m also in­ter­ested in that.

No comments.