I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it’s simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:
1. About local optimizers. I didn’t understand this section at all! Are you claiming that gradient descent isn’t a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?
2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn’t work (it’s inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you’re thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.
3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?
4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you’re looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don’t do that. In what sense is “the bayesian prior could be encoded purely in the initial weight distribution.” What’s more, it seems to me you’re thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?
I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it’s simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:
1. About local optimizers. I didn’t understand this section at all! Are you claiming that gradient descent isn’t a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?
2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn’t work (it’s inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you’re thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.
3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?
4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you’re looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don’t do that. In what sense is “the bayesian prior could be encoded purely in the initial weight distribution.” What’s more, it seems to me you’re thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?