Hmmm… I don’t know if that’s how I would describe what’s happening. I would say:
The above post provides empirical evidence that there isn’t much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
The answer to this question, one way or another, is important for understanding the DNN prior, where it’s coming from, and what sorts of things it’s likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?
OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I’ll check out the experiments you linked.
I’m curious to know what you mean by “focus on the architecture instead.” My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
In my opinion, I think it’s certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What’s surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.
I’m having trouble parsing your first sentence—are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?
I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don’t affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)
Hmmm… I don’t know if that’s how I would describe what’s happening. I would say:
The above post provides empirical evidence that there isn’t much difference between the generalization performance of “doing SGD on DNNs until you get some level of performance” and “randomly sampling DNN weights until you get some level of performance.”
I find that result difficult to reconcile with both theoretical and empirical arguments for why SGD should be different than random sampling, such as the experiments I ran and linked above.
The answer to this question, one way or another, is important for understanding the DNN prior, where it’s coming from, and what sorts of things it’s likely to incentivize—e.g. is there a simplicity bias which might incentivize mesa-optimization or incentivize pseudo-alignment?
OK, thanks! The first bullet point is I think a summary of my first two bullet points, but with different emphasis. I’ll check out the experiments you linked.
I’m curious to know what you mean by “focus on the architecture instead.” My guess is that if the OP is right, pretty much any neural net architecture will have a simplicity bias.
In my opinion, I think it’s certainly and obviously true that neural net architecture is a huge contributor to inductive biases and comes with a strong simplicity bias. What’s surprising to me is that I would expect a similar thing to be true of SGD and yet these results seem to indicate that SGD vs. random search has only a pretty minimal effect on inductive biases.
I’m having trouble parsing your first sentence—are you saying that yes, pretty much any neural net architecture will have a simplicity bias, but saying also that the biases will be importantly different depending on specifically which architecture you pick?
I think I would have predicted that SGD vs. random search would have a pretty minimal effect on inductive biases. My pet theory for why neural nets have a bias towards simplicity is that there are more ways for neural nets to encode simple functions than complex functions, i.e. larger regions of parameter space for simple functions. As the OP argues (I think) if this is right, then it makes sense that SGD and random search don’t affect the bias that much, since larger regions of parameter space will also have larger basins of attraction for SGD to roll down. (As for the justification of my pet theory, well, this is really sketchy but see my top-level comment below)