Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
a⋅cascade_multiplier(x,y)+(1−a)⋅Karatsuba(x,y)
And a is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis[1].
Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
I can see how the initial parameters are independent. After a significant amount of training though...?
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
a⋅cascade_multiplier(x,y)+(1−a)⋅Karatsuba(x,y)
And a is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis[1].
Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
https://arxiv.org/abs/1803.03635v1 etc.
I can see how the initial parameters are independent. After a significant amount of training though...?