peterbarnett comments on Hypothesis: gradient descent prefers general circuits

peterbarnett 14 Feb 2022 10:08 UTC
2 points
0
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
$a \cdot cascade_multiplier (x, y) + (1 - a) \cdot Karatsuba (x, y)$
And $a$ is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
- TLW 15 Feb 2022 3:06 UTC
  2 points
  0
  Parent
  As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
  Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis^[1].
  Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
  1. ^
    https://arxiv.org/abs/1803.03635v1 etc.
- TLW 15 Feb 2022 3:08 UTC
  1 point
  0
  Parent
  Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
  I can see how the initial parameters are independent. After a significant amount of training though...?