epicurus comments on Optimisation: Selective versus Predictive

epicurus 12 May 2026 22:14 UTC
3 points
0
I think it’s possible that gradient descent works by applying a selection pressure to preexisting circuits in the initial randomization with some finetuning. This would explain why most weights are zero after training as well as stuff like the lottery ticket hypothesis.
- 1a3orn 13 May 2026 5:38 UTC
  4 points
  0
  Parent
  
  This would explain why most weights are zero after training
  
  As far as I know this is just false, though?