A Ray comments on Musings on the Speed Prior

A Ray 4 Mar 2022 22:37 UTC
LW: 8 AF: 5
AF
I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.
I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent. Unfortunately a lot of this is vague until you start specifying the domain of model. I think specifying this more clearly will help communicating about these ideas. To start with this myself, when I talk about circuit induction, I’m talking about things that look like large randomly initialized bayes nets (or deep neural networks).
Program Induction Priors are bad: I would claim that any program induction priors (simplicity prior, speed prior, others) are almost always a bad fit for developing useful intuitions about the behavior of large random bayes net machines.
Confusion between circuit induction speed prior and simplicity prior: I think your point about double descent is wrong — in particular, the speed is largely unchanged in double descent experiments, since the *width* is the parameter being varied, and all deep neural networks of the same depth have approximately the same speed (unless you mean something weird by speed).
Circuit Simplicity: You give circuit-size and circuit-depth as examples of a “speed prior”, which seems pretty nonstandard, especially when describing it as “not the simplicity prior”.
More than Speed and Simplicity: I think there are other metrics that provide interesting priors over circuits, like likelihood under some initialization distribution. In particular, I think “likelihood under the initialization distribution” is the prior that matters most, until we develop techniques that let us “hack the prior”.
Connection to Infinite-Size Neural Networks: I think research about neural networks approaching/at the infinite limit looks a lot like physics about black holes — and similarly can tell us interesting things about dynamics we should expect. In particular, for systems optimized by gradient descent, we end up with infinitesimal/nonexistent feature learning in the limit — which is interesting because all of the sub-modules/sub-circuits we start with are all we’ll ever have! This means that even if there are “simple” or “fast” circuits, if they’re not likely under the initialization distribution, then we expect they’ll have a vanishingly small effect on the output. (One way of thinking about this is in terms of the NTK, that even if we have extremely powerfully predictive modules, their predictive power will be overwhelmed by the much more common and simple features)
Hacking the Prior: Right now we don’t have a good understanding of the behavior of partially-hand coded neural networks, but I think they could serve as a new/distinct class of models (with regards to what functions are likely under the initialization distribution). Concretely, this could look like us “hand-programming” circuits or parts of neural networks, then randomly initializing the rest, and see if during training the model learns to use those programmed functions.