[Question] What happens to variance as neural network training is scaled? What does it imply about “lottery tickets”?

Daniel Koko­ta­jlo asks whether the lot­tery ticket hy­poth­e­sis im­plies the scal­ing hy­poth­e­sis.

The way I see it, this de­pends on the dis­tri­bu­tion of “lot­tery tick­ets” be­ing drawn from.

  • If the qual­ity of lot­tery tick­ets fol­lows a nor­mal dis­tri­bu­tion, then af­ter your neu­ral net­work is large enough to sam­ple de­cent tick­ets, it will get bet­ter rather slowly as you scale it—you have to sam­ple a whole lot of tick­ets to get a re­ally good one.

  • If the qual­ity of tick­ets has a long up­ward tail, then you’ll see bet­ter scal­ing.

How­ever, a long tail also sug­gests to me that var­i­ance in re­sults would con­tinue to be rel­a­tively high as a net­work is scaled: big­ger net­works are hit­ting big­ger jack­pots, but since even big­ger jack­pots are within reach, the pay­off of scal­ing re­mains chaotic.

(This could all benefit from a more math­e­mat­i­cal treat­ment.)

So: what do we know about NN train­ing? Does it sug­gest we are liv­ing in ex­trem­is­tan or mediocristan?

Note: a ma­jor con­cep­tual difficulty to an­swer­ing this ques­tion is rep­re­sent­ing NN qual­ity in the right units. For ex­am­ple, an ac­cu­racy met­ric—which nec­es­sar­ily falls be­tween 0% and 100% -- must yield “diminish­ing re­turns”, and can­not be host to a “long-tailed dis­tri­bu­tion”. Take that same met­ric and send it through an in­verse sig­moid, and now you might not have diminish­ing re­turns, and could have a long-tail dis­tri­bu­tion. But we can trans­form data all day. The anal­y­sis shouldn’t be too ad-hoc. So it’s not im­me­di­ately clear how to mea­sure this.