# Vivek Hebbar

Karma: 288
• 15 Jul 2022 23:08 UTC
LW: 2 AF: 2
1 ∶ 0
AF

falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What is your list of problems by urgency, btw? Would be curious to know.

• If we didn’t even have the verb “to fly”, and nobody had seen something fly, “going up and travelling sideways while hovering some distance above the ground” would have been a weird niche idea, and people like the Wright Brothers would have probably never even heard of it. It could have easily taken decades longer.

• 11 Jul 2022 8:55 UTC
LW: 5 AF: 4
1 ∶ 0
AF
in reply to: Thomas Kwa’s comment

From this paper, “Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)”

So for overparameterized nets, the answer is probably:

• There is only one solution manifold, so there are no separate basins. Every solution is connected.

• We can salvage the idea of “basin volume” as follows:

• In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.

• In the dimensions parallel to the manifold, ask “how can I move before it stops being the ‘same function’?”. If we define “sameness” as “same behavior on the validation set”, then this means looking at the Jacobian of that behavior in the plane of the manifold.

• Multiply the two hypervolumes to get the hypervolume of our “basin segment” (very roughly, the region of the basin which drains to our specific model)

1. ^

There are other “sameness” measures which look at the internals of the model; I will be proposing one in an upcoming post.

• 11 Jul 2022 1:15 UTC
4 points
1 ∶ 0
in reply to: Daniel Kokotajlo’s comment

seems like AI tech could totally lead to much faster growth than today, and in particular faster than exponential growth

Strongly agree.

Are you modelling history as a series of different regimes, each one of which is exponential but taken together comprise power-law growth?

I am not. The model is fully continuous, and involves the variables {science, technology, population, capital}. When you run the model, it naturally gives rise to a series of “phase changes”. The phase changes are smooth but still quite distinct. Some of them are caused by changes in which inputs are bottlenecking a certain variable.

The phases predicted are:

1. Super-exponential growth (Sci&Tech bottlenecked by labor surplus; BC to ~1700 AD (??))

3. Fast exponential growth for a short period (population growth slows, causing less consumption)

4. Slow exponential growth for some time (less population growth --> less science --> less economic growth after a delay)

5. Super-exponential growth as AI replaces human researchers (population stops bottlenecking Sci&Tech as capital can be converted into intelligence)

My claim is that:

• We are in phase 4, and that we don’t have enough automation of research to see the beginnings of phase 5 in GDP data.

• Extrapolating GDP data tells us basically zero about when phase 5 will start. The timing can only be predicted with object-level reasoning about AI.

• Phase 4 doesn’t fit the model of “growth always increases from one phase to the next”. Indeed, if you look at real economic data, the first world has had lower growth in recent decades than it did previously. Hence, power law extrapolation across phases is inappropriate.

1. ^

I don’t mean this in a mathematically rigorous way

2. ^

As I think about this more and compare to what actually happened in history, I’m starting to doubt my model a lot more, since I’m not sure if the timing and details of the postulated phases line up properly with real world data.

• 11 Jul 2022 0:10 UTC
LW: 3 AF: 3
1 ∶ 0
AF
in reply to: Thomas Larsen’s comment

The loss is defined over all dimensions of parameter space, so is still a function of all 3 x’s. You should think of it as . It’s thickness in the direction is infinite, not zero.

Here’s what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:

1. Regularization /​ weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding to the Hessian.

2. The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of , the standard deviation of the initialization).

So the is a fairly principled correction, and much better than just “throwing out” the other dimensions. “Throwing out” dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.

• 10 Jul 2022 9:01 UTC
12 points
2 ∶ 0
in reply to: Daniel Kokotajlo’s comment

Some problems with the power law extrapolation for GDP:

1. The graph is for the whole world, not just the technological leading edge, which obscures the thing which is conceivably relevant (the endogenous trend in tech advancement at the leading edge)

2. The power law model is a bad fit for the GDP per capita of the first world in the last 50-100 years

3. Having built a toy endogenous model of economic growth, I see no gears-level reason to expect power law growth in our current regime. (Disclaimer: I’m not an economist, and haven’t tested my model on anything.) The toy model presented in the OpenPhil report is much simpler and IMO less realistic.

# Hes­sian and Basin volume

10 Jul 2022 6:59 UTC
32 points
• I will split this into a math reply, and a reply about the big picture /​ info loss interpretation.

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false! Edit: I am now confused, I don’t know if it is false or not.

You are conflating and . Adding disambiguation, we have:

So we see that the second term disappears if . But the critical point condition is . From chain rule, we have:

So it is possible to have a local minimum where , if is in the left null-space of . There is a nice qualitative interpretation as well, but I don’t have energy/​time to explain it.

However, if we are at a perfect-behavior global minimum of a regression task, then is definitely zero.

A few points about rank equality at a perfect-behavior global min:

1. holds as long as is a diagonal matrix. It need not be a multiple of the identity.

2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.

3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.

4. We can extend to larger outputs by having the behavior be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn’t know what it was called /​ what field it’s from. I should clarify that “key result” here just meant “key result of the math so far—pay attention”, not “key result of the whole post” or “profound/​original”.

The Jacobian matrix is what you call I think

Yeah, you’re right. Previously I thought was the Jacobian, because I had the Jacobian transposed in my head. I only realized that has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank.

Yes; this is the whole point of the post. The math is just a preliminary to get there.

But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right?

Good catch—it is technically possible at a local minimum, although probably extremely rare. At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss. Note that behavior in this post was defined specifically on the training set. At global minima, “Rank(Hessian(Loss))=Rank(G)” should be true without exception.

And more generally, in exactly which directions do the implications go?

In “Flat basin Low-rank Hessian Low-rank High manifold dimension”:

The first “” is a correlation. The second “” is the implication “High manifold dimension ⇒ Low-rank ”. (Based on what you pointed out, this only works at global minima).

when you say things like “Low rank indicates information loss”

“Indicates” here should be taken as slightly softened from “implies”, like “strongly suggests but can’t be proven to imply”. Can you think of plausible mechanisms for causing low rank which don’t involve information loss?

• 23 May 2022 5:11 UTC
LW: 1 AF: 1
AF

I’m pretty sure my framework doesn’t apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

• About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

• Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the “asymptotic equipartitioning” math, via manifold counts containing terms like “A choose B” from permutations of neurons.

• Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. “Saddle points all the way down” isn’t possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here.

I might return to the issue of classification /​ solutions at infinity in a later post, but for now I will say this: It doesn’t seem that much different, especially when it comes to manifold dimension; an m-dimensional manifold in parameter space generally extends to infinity, and it corresponds to an m-1 dimensional manifold in angle space (you can think of it as a hypersphere of asymptote directions).

I would say the main things neglected in this post are:

1. Manifold count (Most important neglected thing)

2. Basin width in non-infinite directions

3. Distance from the origin

These apply to both regression and classification.

• This is only one step toward a correct theory of inductive bias. I would say that “clear and important implications” will only come weeks from now, when we are much less confused and have run more experiments.
The main audience for this post is researchers whose work is directly adjacent to inductive bias and training dynamics. If you don’t need gears-level insights on this topic, I would say the tl;dr is: “Circuit simplicity seems kind of wrong; there’s a cool connection between information loss and basin flatness which is probably better but maybe still very predictive; experiments are surprising so far; stay posted for more in ~2 weeks.”

# [Short ver­sion] In­for­ma­tion Loss --> Basin flatness

21 May 2022 12:59 UTC
11 points