falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix
What is your list of problems by urgency, btw? Would be curious to know.
falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix
What is your list of problems by urgency, btw? Would be curious to know.
If we didn’t even have the verb “to fly”, and nobody had seen something fly, “going up and travelling sideways while hovering some distance above the ground” would have been a weird niche idea, and people like the Wright Brothers would have probably never even heard of it. It could have easily taken decades longer.
Upvoted for NN quines (didn’t know about this, seems cool)
From this paper, “Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)”
So for overparameterized nets, the answer is probably:
There is only one solution manifold, so there are no separate basins. Every solution is connected.
We can salvage the idea of “basin volume” as follows:
In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
In the dimensions parallel to the manifold, ask “how can I move before it stops being the ‘same function’?”. If we define “sameness” as “same behavior on the validation set”,[1] then this means looking at the Jacobian of that behavior in the plane of the manifold.
Multiply the two hypervolumes to get the hypervolume of our “basin segment” (very roughly, the region of the basin which drains to our specific model)
There are other “sameness” measures which look at the internals of the model; I will be proposing one in an upcoming post.
seems like AI tech could totally lead to much faster growth than today, and in particular faster than exponential growth
Strongly agree.
Are you modelling history as a series of different regimes, each one of which is exponential but taken together comprise power-law growth?
I am not. The model is fully continuous, and involves the variables {science, technology, population, capital}. When you run the model, it naturally gives rise to a series of “phase changes”. The phase changes are smooth[1] but still quite distinct. Some of them are caused by changes in which inputs are bottlenecking a certain variable.
The phases predicted are:[2]
Super-exponential growth (Sci&Tech bottlenecked by labor surplus; BC to ~1700 AD (??))
Steady exponential growth
Fast exponential growth for a short period (population growth slows, causing less consumption)
Slow exponential growth for some time (less population growth --> less science --> less economic growth after a delay)
Super-exponential growth as AI replaces human researchers (population stops bottlenecking Sci&Tech as capital can be converted into intelligence)
My claim is that:
We are in phase 4, and that we don’t have enough automation of research to see the beginnings of phase 5 in GDP data.
Extrapolating GDP data tells us basically zero about when phase 5 will start. The timing can only be predicted with object-level reasoning about AI.
Phase 4 doesn’t fit the model of “growth always increases from one phase to the next”. Indeed, if you look at real economic data, the first world has had lower growth in recent decades than it did previously. Hence, power law extrapolation across phases is inappropriate.
The loss is defined over all dimensions of parameter space, so is still a function of all 3 x’s. You should think of it as . It’s thickness in the direction is infinite, not zero.
Here’s what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding to the Hessian.
The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of , the standard deviation of the initialization).
So the is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.
Some problems with the power law extrapolation for GDP:
The graph is for the whole world, not just the technological leading edge, which obscures the thing which is conceivably relevant (the endogenous trend in tech advancement at the leading edge)
The power law model is a bad fit for the GDP per capita of the first world in the last 50-100 years
Having built a toy endogenous model of economic growth, I see no gears-level reason to expect power law growth in our current regime. (Disclaimer: I’m not an economist, and haven’t tested my model on anything.) The toy model presented in the OpenPhil report is much simpler and IMO less realistic.
I will split this into a math reply, and a reply about the big picture / info loss interpretation.
Math reply:
Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.
I agree that assuming MSE, and looking at a local minimum, you have
This is still false! Edit: I am now confused, I don’t know if it is false or not.
You are conflating and . Adding disambiguation, we have:
So we see that the second term disappears if . But the critical point condition is . From chain rule, we have:
So it is possible to have a local minimum where , if is in the left null-space of . There is a nice qualitative interpretation as well, but I don’t have energy/time to explain it.
However, if we are at a perfect-behavior global minimum of a regression task, then is definitely zero.
A few points about rank equality at a perfect-behavior global min:
holds as long as is a diagonal matrix. It need not be a multiple of the identity.
Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.
If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.
We can extend to larger outputs by having the behavior be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.
Thanks for this reply, its quite helpful.
I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.
Ah nice, didn’t know what it was called / what field it’s from. I should clarify that “key result” here just meant “key result of the math so far—pay attention”, not “key result of the whole post” or “profound/original”.
The Jacobian matrix is what you call I think
Yeah, you’re right. Previously I thought was the Jacobian, because I had the Jacobian transposed in my head. I only realized that has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote.
Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank.
Yes; this is the whole point of the post. The math is just a preliminary to get there.
But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right?
Good catch—it is technically possible at a local minimum, although probably extremely rare. At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss. Note that behavior in this post was defined specifically on the training set. At global minima, “Rank(Hessian(Loss))=Rank(G)” should be true without exception.
And more generally, in exactly which directions do the implications go?
In “Flat basin Low-rank Hessian Low-rank High manifold dimension”:
The first “” is a correlation. The second “” is the implication “High manifold dimension ⇒ Low-rank ”. (Based on what you pointed out, this only works at global minima).
when you say things like “Low rank indicates information loss”
“Indicates” here should be taken as slightly softened from “implies”, like “strongly suggests but can’t be proven to imply”. Can you think of plausible mechanisms for causing low rank which don’t involve information loss?
I’m pretty sure my framework doesn’t apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.
I’ll reply to the rest of your comment later today when I have some time
About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.
Yup, seems correct.
Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the “asymptotic equipartitioning” math, via manifold counts containing terms like “A choose B” from permutations of neurons.
Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. “Saddle points all the way down” isn’t possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here.
I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: It doesn’t seem that much different, especially when it comes to manifold dimension; an m-dimensional manifold in parameter space generally extends to infinity, and it corresponds to an m-1 dimensional manifold in angle space (you can think of it as a hypersphere of asymptote directions).
I would say the main things neglected in this post are:
Manifold count (Most important neglected thing)
Basin width in non-infinite directions
Distance from the origin
These apply to both regression and classification.
This is only one step toward a correct theory of inductive bias. I would say that “clear and important implications” will only come weeks from now, when we are much less confused and have run more experiments.
The main audience for this post is researchers whose work is directly adjacent to inductive bias and training dynamics. If you don’t need gears-level insights on this topic, I would say the tl;dr is: “Circuit simplicity seems kind of wrong; there’s a cool connection between information loss and basin flatness which is probably better but maybe still very predictive; experiments are surprising so far; stay posted for more in ~2 weeks.”
I don’t think this sorts out the fundamental fuzziness of what “winning” means—what you need is a theory of counterfactuals