TLW comments on Hypothesis: gradient descent prefers general circuits

TLW 15 Feb 2022 3:08 UTC
1 point
0
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
I can see how the initial parameters are independent. After a significant amount of training though...?