Zach Furman
Learning coefficient estimation: the details
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, “I think your software has X vulnerability in it.” Person B says, “This is a highly specific scenario, and I suspect you don’t have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me.”
2) Person B says to Person A, “Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I’m so confident, I give it a 99.99999%+ chance.” Person A says, “I can’t specify the exact vulnerability your software might have without it in front of me, but I’m fairly sure this confidence is unwarranted. In general it’s easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn’t because I think X will actually be the vulnerability, I’m just trying to be illustrative.”
Story 1 seems to be the case where “POC or GTFO” is justified. Story 2 seems to be the case where “security mindset” is justified.
It’s very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying “your code probably has vulnerabilities,” while true, isn’t very helpful, so you may still want to say “POC or GTFO”—but this isn’t because you think they’re wrong, it’s because they haven’t given you any new information.
Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.
Since nobody here has made the connection yet, I feel obliged to write something, late as I am.
To make the problem more tractable, suppose we restrict our set of coordinate changes to ones where the resulting functions can still (approximately) be written as a neural network. (These are usually called “reparameterizations.”) This occurs when multiple neural networks implement (approximately) the same function; they’re redundant. One trivial example of this is the invariance of ReLU networks to scaling one layer by a constant, and the next layer by the inverse of that constant.
Then, in the language of parametric statistics, this phenomenon has a name: non-identifiability! Lucky for us, there’s a decent chunk of literature on identifiability in neural networks out there. At first glance, we have what seems like a somewhat disappointing result: ReLU networks are identifiable up to permutation and rescaling symmetries.
But there’s a catch—this is only true except for a set of measure zero. (The other catch is that the results don’t cover approximate symmetries.) This is important because there are reasons to suggest real neural networks are pushed close to this set during training. This set of measure zero corresponds to “reducible” or “degenerate” neural networks—those that can be expressed with fewer parameters. And hey, funny enough, aren’t neural networks quite easily pruned?
In other parts of the literature, this problem has been phrased differently, under the framework of “structure-function symmetries” or “canonicalization.” It’s also often covered when discussing the concepts of “inverse stability” and “stable recovery.” For more on this, including a review of the literature, I highly recommend Matthew Farrugia-Roberts’ excellent master’s thesis on the topic.
(Separately, I’m currently working on the issue of coordinate-free sparsity. I believe I have a solution to this—stay tuned, or reach out if interested.)
I can’t speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I’m significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic “structure” we find in trained models (both ML and biological!) and “structure” in the data generating process.
That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to “connect our understanding of neural networks to our understanding of the real world”. This is the single most striking thing to come out of interpretability, in my opinion, and I’m worried about a “deep learning theory of everything” if it doesn’t address this head on.
That said, NTK doesn’t promise to be a theory of everything, so I don’t mean to hold it to an unreasonable standard. It does what it says on the tin! I just don’t think it’s explained a lot of the remaining questions I have. I don’t think we’re in a situation where “we can explain 80% of a given model’s behavior with the NTK” or similar. And this is relevant for e.g. studying inductive biases, as you mentioned.
But I strong upvoted your comment, because I do think deep learning theory can fill this gap—I’m personally trying to work in this area. There are some tractable-looking directions here, and people shouldn’t neglect them!
Someone with better SLT knowledge might want to correct this, but more specifically:
Studying the “volume scaling” of near-min-loss parameters, as beren does here, is really core to SLT. The rate of change of this volume as you change your epsilon loss tolerance is called the “density of states” (DOS) function, and much of SLT basically boils down to an asymptotic analysis of this function. It also relates the terms in the asymptotic expansion to things you care about, like generalization performance.
You might wonder why SLT needs so much heavy machinery, since this sounds so simple—and it’s basically because SLT can handle the case where the eigenvalues of the Hessian are zero, and the usual formula breaks down. This is actually important in practice, since IIRC real models often have around 90% zero eigenvalues in their Hessian. It also leads to substantially different theory—for instance the “effective number of parameters” (RLCT) can vary depending on the dataset.
Exponential growth is a fairly natural thing to expect here, roughly for the same reason that vanishing/exploding gradients happen (input/output sensitivity is directly related to param/output sensitivity). Based on this hypothesis, I’m preregistering the prediction that (all other things equal) the residual stream in post-LN transformers will exhibit exponentially shrinking norms, since it’s known that post-LN transformers are more sensitive to vanishing gradient problems compared to pre-LN ones.
Edit: On further thought, I still think this intuition is correct, but I expect the prediction is wrong—the notion of relative residual stream size in a post-LN transformer is a bit dubious, since the size of the residual stream is entirely determined by the layer norm constants, which are a bit arbitrary because they can be rolled into other weights. I think the proper prediction is more around something like Lyapunov exponents.
No substantive reply, but I do want to thank you for commenting here—original authors publicly responding to analysis of their work is something I find really high value in general. Especially academics that are outside the usual LW/AF sphere, which I would guess you are given your account age.
Neural network polytopes (Colab notebook)
A possible counterpoint, that you are mostly advocating for awareness as opssosed to specific points is null, since pretty much everyone is aware of the problem now—both society as a whole, policymakers in particular, and people in AI research and alignment.
I think this specific point is false, especially outside of tech circles. My experience has been that while people are concerned about AI in general, and very open to X-risk when they hear about it, there is zero awareness of X-risk beyond popular fiction. It’s possible that my sample isn’t representative here, but I would expect that to swing in the other direction, given that the folks I interact with are often well-educated New-York-Times-reading types, who are going to be more informed than average.
Even among those aware, there’s also a difference between far-mode “awareness” in the sense of X-risk as some far away academic problem, and near-mode “awareness” in the sense of “oh shit, maybe this could actually impact me.” Hearing a bunch of academic arguments, but never seeing anybody actually getting fired up or protesting, will implicitly cause people to put X-risk in the first bucket. Because if they personally believed it to be big a near-term risk, they’d certainly be angry and protesting, and if other people aren’t, that’s a signal other people don’t really take it seriously. People sense a missing mood here and update on it.
A bit of a side note, but I don’t even think you need to appeal to new architectures—it looks like the NTK approximation performs substantially worse even with just regular MLPs (see this paper, among others).
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
For anyone who wants to play around with this themselves, you might be interested in a small Colab notebook I made, with some interactive 2D and 3D plots.
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn’t misaligned, etc.) We’d just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.
My summary (endorsed by Jesse):
1. ERM can be derived from Bayes by assuming your “true” distribution is close to a deterministic function plus a probabilistic error, but this fact is usually obscured
2. Risk is not a good inner product (naively) - functions with similar risk on a given loss function can be very different
3. The choice of functional norm is important, but uniform convergence just picks the sup norm without thinking carefully about it
4. There are other important properties of models/functions than just risk
5. Learning theory has failed to find tight (generalization) bounds, and bounds might not even be the right thing to study in the first place
I don’t think the game is an alarming capability gain at all—I agree with LawrenceC’s comment below. It’s more of a “gain-of-function research” scenario to me. Like, maybe we shouldn’t deliberately try to train a model to be good at this? If you’ve ever played Diplomacy, you know the whole point of the game is manipulating and backstabbing your way to world domination. I think it’s great that the research didn’t actually seem to come up with any scary generalizable techniques or dangerous memetics, but I think ideally shouldn’t even be trying in the first place.
Dropping some late answers here—though this isn’t my subfield, so forgive me if I mess things up here.
Correct me if I’m wrong, but it struck while reading this that you can think of a neural network as learning two things at once:
a classification of the input into 2^N different classes (where N is the total number of neurons), each of which gets a different function applied to it
those functions themselves
This is exactly what a spline is! This is where the spline view of neural networks comes from (mentioned in Appendix C of the post). What you call “classes” the literature typically calls the “partition.” Also, while deep networks can theoretically have exponentially many elements in the partition (w.r.t. the number of neurons), in practice, they instead are closer to linear.
Can the functions and classes be decoupled?
To my understanding this is exactly what previous (non-ML) research on splines did, with things like free-knot splines. Unfortunately this is computationally intractable. So instead much research focused on fixing the partition (say, to a uniform grid), and changing only the functions. A well-known example here is the wavelet transform. But then you lose the flexibility to change the partition—incredibly important if some regions need higher resolution than others!
From this perspective the coupling of functions to the partition is exactly what makes neural networks good approximators in the first place! It allows you to freely move the partition, like with free-knot splines, but in a way that’s still computationally tractable. Intuitively, neural networks have the ability to use high resolution where it’s needed most, like how 3D meshes of video game characters have the most polygons in their face.
How much of the power of neural networks comes from their ability to learn to classify something into exponentially many different classes vs from the linear transformations that each class implements?
There are varying answers here, depending on what you mean by “power”: I’d say either the first or neither. If you mean “the ability to approximate efficiently,” then I would probably say that the partition matters more—assuming the partition is sufficiently fine, each linear transformation only performs a “first order correction” to the mean value of the partition.
But I don’t really think this is where the “magic” of deep learning comes from. In fact this approximation property holds for all neural networks, including shallow ones. It can’t capture what I see as the most important properties, like what makes deep networks generalize well OOD. For that you need to look elsewhere. It appears like deep neural networks have an inductive bias towards simple algorithms, i.e. those with a low (pseudo) Kolmogorov complexity. (IMO, from the spline perspective, a promising direction to explain this could be via compositionality and degeneracy of spline operators.)
Hope this helps!
It’s worth noting that Jesse is mostly following the traditional “approximation, generalization, optimization” error decomposition from learning theory here—where “generalization” specifically refers to finite-sample generalization (gap between train/test loss), rather than something like OOD generalization. So e.g. a failure of transformers to solve recursive problems would be a failure of approximation, rather than a failure of generalization. Unless I misunderstood you?
Repeating a question I asked Jesse earlier, since others might be interested in the answer: how come we tend to hear more about PAC bounds than MAC bounds?
Broadly agree with this post, though I’ll nitpick the inclusion of robotics here. I don’t think it’s progressing nearly as fast as ML, and it seems fairly uncontroversial that we’re not nearly as close to human-level motor control as we are to (say) human-level writing. I only bring this up because a decent chunk of bad reasoning (usually underestimation) I see around AGI risk comes from skepticism about robotics progress, which is mostly irrelevant in my model.