RSS
• Do you generally think that people in the AI safety community should write publicly about what they think is “the missing AGI ingredient”?

It’s remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).

• I will split this into a math reply, and a reply about the big picture /​ info loss interpretation.

Math reply:

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false! Edit: I am now confused, I don’t know if it is false or not.

You are conflating and . Adding disambiguation, we have:

So we see that the second term disappears if . But the critical point condition is . From chain rule, we have:

So it is possible to have a local minimum where , if is in the left null-space of . There is a nice qualitative interpretation as well, but I don’t have energy/​time to explain it.

However, if we are at a perfect-behavior global minimum of a regression task, then is definitely zero.

A few points about rank equality at a perfect-behavior global min:

1. holds as long as is a diagonal matrix. It need not be a multiple of the identity.

2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.

3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.

4. We can extend to larger outputs by having the behavior be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.

• Great post! This seems like a useful perspective to keep in mind.

Somewhat orthogonally to the theoretical picture, I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice. For example, if PPO is tuned appropriately, the KL penalty term can be removed from the reward entirely—instead, PPO’s implicit “local” KL penalty controls the rate of policy change.

If we were in the regime of optimizing the policy significantly more, experience from traditional RL suggests that there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

• Thanks for the substantive reply.

First some more specific/​detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/​codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function we have:

i.e. it’s that might be something like MSE, but the function ″ is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like

We want to differentiate this twice with respect to essentially. Firstly, we have

where—just to keep track of this—we’ve got:

Or, using ‘coordinates’ to make it explicit:

for . Then for we differentiate again:

Or,

This is now at the level of matrices. Avoiding getting into any depth about tensors and indices, the term is basically a tensor-type object and it’s paired with which is a vector to give something that is .

So what I think you are saying now is that if we are at a local minimum for , then the second term on the right-hand side vanishes (because the term includes the first derivatives of , which are zero at a minimum). You can see however that if the Hessian of is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of , the Hessian of the loss is equal to a constant times . So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have .

(In case it’s of interest to anyone, googling turned up this recent paper https://​​openreview.net/​​forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: “Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.”)

Some broader remarks: I think these are nice observations but unfortunately I think generally I’m a bit confused/​unclear about what else you might get out of going along these lines. I don’t want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the “cause” of low-rank G, I do think that one could try to bring in a notion of “information loss” in neural networks, but for it to be substantive one needs to be careful that it’s not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/​informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it ‘loses information’ where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset. Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can’t be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/​differentiable spaces to do the theoretical analysis), you’ve more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don’t think it’s somehow incorrect to point to this but it becomes more like just saying ‘perhaps one useful definition of information loss is low rank G’ as opposed to linking one phenomenon to the other.

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!

• Maybe. My model was a bit janky; I basically assume DSA-ability comes from clock-time lead but then also assumed that as technology and progress speed up the necessary clock-time lead shrinks. And I guesstimated that it would shrink to 0.3 − 3 years. I bet there’s a better way, that pegs DSA-ability to ideas lead… it would be a super cool confirmation of this better model if we could somehow find data confirming that years-needed-for-DSA has fallen in lockstep as ideas-produced-per-year has risen.

• You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it’s also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don’t want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.

• Thanks for this reply, its quite helpful.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn’t know what it was called /​ what field it’s from. I should clarify that “key result” here just meant “key result of the math so far—pay attention”, not “key result of the whole post” or “profound/​original”.

The Jacobian matrix is what you call I think

Yeah, you’re right. Previously I thought was the Jacobian, because I had the Jacobian transposed in my head. I only realized that has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank.

Yes; this is the whole point of the post. The math is just a preliminary to get there.

But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right?

Good catch—it is technically possible at a local minimum, although probably extremely rare. At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss. Note that behavior in this post was defined specifically on the training set. At global minima, “Rank(Hessian(Loss))=Rank(G)” should be true without exception.

And more generally, in exactly which directions do the implications go?

In “Flat basin Low-rank Hessian Low-rank High manifold dimension”:

The first “” is a correlation. The second “” is the implication “High manifold dimension ⇒ Low-rank ”. (Based on what you pointed out, this only works at global minima).

when you say things like “Low rank indicates information loss”

“Indicates” here should be taken as slightly softened from “implies”, like “strongly suggests but can’t be proven to imply”. Can you think of plausible mechanisms for causing low rank which don’t involve information loss?

• But realistically not all projects will hoard all their ideas. Suppose instead that for the leading project, 10% of their new ideas are discovered in-house, and 90% come from publicly available discoveries accessible to all. Then, to continue the car analogy, it’s as if 90% of the lead car’s acceleration comes from a strong wind that blows on both cars equally. The lead of the first car/​project will lengthen slightly when measured by distance/​ideas, but shrink dramatically when measured by clock time.

The upshot is that we should return to that table of factors and add a big one to the left-hand column: Leads shorten automatically as general progress speeds up, so if the lead project produces only a small fraction of the general progress, maintaining a 3-year lead throughout a soft takeoff is (all else equal) almost as hard as growing a 3-year lead into a 30-year lead during the 20th century. In order to overcome this, the factors on the right would need to be very strong indeed.

But won’t “ability to get a DSA” be linked to the lead as measured in ideas rather than clock time?

• Retain or forget information and skills over long time scales, in a way that serves its goals. E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.

If memory is cheap, designing it to just remember everything may be a good idea. And there may be some architectural reason why choosing to forget things is hard.

• Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.

But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.

This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).

(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)

Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.

• This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let’s call it say, and then for each element of the training set, we can consider the function which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point . We are thinking of both and as smooth (or at least sufficiently differentiable) spaces (I take it).

A contour plane is a level set of one of the , i.e. a set of the form

for some and . A behavior manifold is a set of the form

for some .

A more concise way of viewing this is to define a single function and then a behavior manifold is simply a level set of this function. The map is a submersion at if the Jacobian matrix at is a surjective linear map. The Jacobian matrix is what you call I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn’t matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given , if is a submersion in a neighbourhood of a point , then is a smooth -dimensional submanifold in a neighbourhood of .

Essentially, in a neighbourhood of a point at which the Jacobian of has full rank, the level set through that point is an -dimensional smooth submanifold.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a ‘tip of the iceberg’ situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: “The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena—bifurcations, perestroikas (restructurings), catastrophes, metamorphoses—which occur in systems depending on parameters when the parameters vary in a smooth manner”.)

Similarly when you say things like “Low rank indicates information loss”, I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank . It doesn’t seem clear that low rank is necessarily indicative of information loss?

• Correct, though note that this doesn’t let you pick a specific optimization target for that AGI. You’d need more bits in order to specify an optimization target. In other words, there’s still a meaningful sense in which you can’t “steer” the world into a specific target other than one it was likely to hit anyway, at least not without more bits.

• Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn’t that small.

• Nice post! The Game Theory /​ Bureaucracy is interesting. It reminds me of Drexler’s CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/​combine these systems.

• To check my noob understanding:

Suppose you are playing a MMO video game which accepts about 10,000 binary inputs per second, in the form of various keyboard buttons being pressed or not. You want to cause something crazy and meme-worthy to happen in the game, like everyone in the game falling over dead simultaneously. You have an hour or so of playtime total. What’s the craziest thing you can make happen?

Answer: Well, whatever it is, it has to be at least 1/​2^(10,000x3600) likely to happen already. If it was less likely than that, you wouldn’t be able to make it likely even if you played perfectly.

• 23 May 2022 5:11 UTC
LW: 1 AF: 1
AF
in reply to: Qria’s comment

I’m pretty sure my framework doesn’t apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

• Does this framework also explain grokking phenomenon?

I haven’t yet fully understood your hypothesis except that behaviour gradient is useful for measuring something related to inductive bias, but above paper seems to touch a similar topic (generalization) with similar methods (experiments on fully known toy examples such as SO5).

• That’s basically correct; the main immediate gain is that it makes it much easier to compute abstractions and compute using abstractions.

One additional piece is that it hints towards a probably-more-fundamental derivation of the theorems in which maximum entropy plays a more central role. The maximum entropy Telephone Theorem already does that, but the resampling + gKPD approach routes awkwardly through gKPD instead; there’s probably a nice way to do it directly via constrained maximization of entropy. That, in turn, would probably yield stronger and simpler theorems.