I’m a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/
Leon Lang
(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022⁄23.)
We Should Prepare for a Larger Representation of Academia in AI Safety
This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.
Thanks for the reply!
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
So if , then is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.
Zeta Functions in Singular Learning Theory
In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.
The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by
where is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by
where is the Kullback-Leibler divergence.
But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about.
The answer: by computing a different integral. So now, I’ll explain the connection to different integrals we can draw.
Let
which is called the state density function. Here, is the Dirac delta function. For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the “size” of different level sets. This state density function is connected to two different things.
Laplace Transform to the Evidence
First of all, it is connected to the evidence above. Namely, let be the Laplace transform of . It is a function given by
In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .
Mellin Transform to the Zeta Function
But how do we compute ? By using another transform. Let be the Mellin transform of . It is a function (or maybe only defined on part of ?) given by
Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.
What’s this useful for?
The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as
Thus, we essentially changed our problem to the problem of studying the zeta function To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.
Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima).
Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the limit.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for , but I’m not certain of that.
Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)
At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .
“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
There is, however, one fundamentally different kind of “phase transition” that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time—the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
Which altered the posterior geometry, but not that of since (up to a normalisation factor).
I didn’t understand this footnote.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior.We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
The weight annihilation phase is never preferred by the posterior
I think there is a typo and you meant .
Thanks Liam also for this nice post! The explanations were quite clear.
The property of being singular is specific to a model class , regardless of the underlying truth.
This holds for singularities that come from symmetries where the model doesn’t change. However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
Both configurations, non-weight-annihilation (left) and weight-annihilation (right)
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.To clarify: there is not necessarily a problem with the tangent, right? E.g., the function has a singularity at because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers.
A model, , parametrized by weights , where is compact;
Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don’t blow up. So in that sense it can maybe be modeled to be compact)
The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood.
I think it is only shifted, and not also rescaled, if I’m not missing something.
But these predictions of “generalization error” are actually a contrived kind of theoretical device that isn’t what we mean by “generalization error” in the typical ML setting.
Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don’t predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I’m missing)
Thanks for the answer mfar!
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, is a free variable, and the condition is talking about linear dependence of functions of .
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let so that and . Then let and be functions such that and .. Then the set of functions is a linearly dependent set of functions because .
Thanks! Apparently the proof of the thing I was wondering about can be found in Lemma 3.4 in Liam’s thesis. Also thanks for your other comments!
Thanks for the answer Liam! I especially liked the further context on the connection between Bayesian posteriors and SGD. Below a few more comments on some of your answers:
The partition function is equal to the model evidence , yep. It isn’t equal to (I assume is fixed here?) but is instead expressed in terms of the model likelihood and prior (and can simply be thought of as the “normalising constant” of the posterior),
and then under this supervised learning setup where we know , we have . Also note that this does “factor over ” (if I’m interpreting you correctly) since the data is independent and identically distributed.
I think I still disagree. I think everything in these formulas needs to be conditioned on the -part of the dataset. In particular, I think the notation is slightly misleading, but maybe I’m missing something here.
I’ll walk you through my reasoning: When I write or , I mean the whole vectors, e.g., . Then I think the posterior compuation works as follows:
That is just Bayes rule, conditioned on in every term. Then, because from alone you don’t get any new information about the conditional (A more formal way to see this is to write down the Bayesian network of the model and to see that and are d-separated). Also, conditioned on , is independent over data points, and so we obtain
So, comparing with your equations, we must have Do you think this is correct?
Btw., I still don’t think this “factors over ”. I think that
The reason is that old data points should inform the parameter , which should have an influence on future updates. I think the independence assumption only holds for the true distribution and the model conditioned on .
If you expand that term out you find that
because the second integral is the first central moment of a Gaussian. The derivative of the prior is irrelevant.
Right. that makes sense, thank you! (I think you missed a factor of , but that doesn’t change the conclusion)
Thanks also for the corrected volume formula, it makes sense now :)
Thanks for this nice post! I fight it slightly more vague than the first post, but I guess that is hard to avoid when trying to distill highly technical topics. I got a lot out of it.
Fundamentally, we care about the free energy because it is a measure of posterior concentration, and as we showed with the BIC calculation in DSLT1, it tells us something about the information geometry of the posterior.
Can you tell more about why it is a measure of posterior concentration (It gets a bit clearer further below, but I state my question nonetheless to express that this statement isn’t locally clear to me here)? I may lack some background in Bayesian statistics here. In the first post, you wrote the posterior as
and it seems like you want to say that if free energy is low, then the posterior is more concentrated. If I look at this formula, then low free energy corresponds to high , meaning the prior and likelihood have to “work quite a bit” to ensure that this expression overall integrates to . Are you claiming that most of that work happens very localized in a small parameter region?
Additionally, I am not quite sure what you mean with “it tells us something about the information geometry of the posterior”, or even what you mean by “information geometry” here. I guess one answer is that you showed in post 1 that the Fisher information matrix appears in the formula for the free energy, which contains geometric information about the loss landscape. But then in the proof, you regarded that as a constant that you ignored in the final BIC formula, so I’m not sure if that’s what you are referring to here. More explicit references would be useful to me.
Since there is a correspondence
we say the posterior prefers a region when it has low free energy relative to other regions of .
Note to other readers (as this wasn’t clear to me immediately): That correspondence holds because one can show that
Here, is the global partition function.
The Bayes generalisation loss is then given by
I believe the first expression should be an expectation over .
It follows immediately that the generalisation loss of a region is
I didn’t find a definition of the left expression.
So, the region in that minimises the free energy has the best accuracy-complexity tradeoff. This is the sense in which singular models obey Occam’s Razor: if two regions are equally accurate, then they are preferred according to which is the simpler model.
Purposefully naive question: can I just choose a region that contains all singularities? Then it surely wins, but this doesn’t help us because this region can be very large.
So I guess you also want to choose small regions. You hinted at that already by saying that should be compact. But now I of course wonder if sometimes just all of lies within a compact set.
There are two singularities in the set of true parameters,
which we will label as and respectively.
Possible correction: one of those points isn’t a singularity, but a regular loss-minimizing point (as you also clarify further below).
Let’s consider a one parameter model with KL divergence defined by
on the region with uniform prior
The prior seems to do some work here: if it doesn’t properly support the region with low RLCT, then the posterior cannot converge there. I guess a similar story might a priori hold for SGD, where how you initialize your neural network might matter for convergence.
How do you think about this? What are sensible choices of priors (or network initializations) from the SLT perspective?
Also, I find it curious that in the second example, the posterior will converge to the lowest loss, but SGD would not since it wouldn’t “manage to get out of the right valley”, I assume. This seems to suggest that the Bayesian view of SGD can at most be true in high dimensions, but not for very low-dimensional neural networks. Would you agree with that, or what is your perspective?
Thank you for this wonderful article! I read it fairly carefully and have a number of questions and comments.
where the partition function (or in Bayesian terms the evidence) is given by
Should I think of this as being equal to , and would you call this quantity ? I was a bit confused since it seems like we’re not interested in the data likelihood, but only the conditional data likelihood under model .
And to be clear: This does not factorize over because every data point informs and thereby the next data point, correct?
The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
But the free energy does not depend on the parameter, so how should I interpret this claim? Are you already one step ahead and thinking about the singular case where the loss landscape decomposes into different “phases” with their own free energy?
there is almost sure convergence as to a constant that doesn’t depend on , [5]
I think the first expression should either be an expectation over , or have the conditional entropy within the parantheses.
In the realisable case where , the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,
I briefly tried showing this and somehow failed. I didn’t quite manage to get rid of the integral over . Is this simple? (You don’t need to show me how it’s done, but maybe mentioning the key idea could be useful)
A regular statistical model class is one which is identifiable (so implies that ), and has positive definite Fisher information matrix for all .
The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular? I found this slightly ambiguous, also because under your definitions further down, it seems like “singular” (degenerate Fisher information matrix) is a stronger condition then “strictly singular” (degenerate Fisher information matrix OR non-injective map from parameters to distributions).
It can be easily shown that, under the regression model, is degenerate if and only the set of derivatives
is linearly dependent.
What is in this formula? Is it fixed? Or do we average the derivatives over the input distribution?
Since every true parameter is a degenerate singularity[9] of , it cannot be approximated by a quadratic form.
Hhm, I thought having a singular model just means that some singularities are degenerate.
One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change. Is this correct?
We can Taylor expand the NLL as
I think you forgot a in the term of degree 1.
In that case, the second term involving vanishes since it is the first central moment of a normal distribution
Could you explain why that is? I may have missed some assumption on or not paid attention to something.
In this case, since for all , we could simply throw out the free parameter and define a regular model with parameters that has identical geometry , and therefore defines the same input-output function, .
Hhm. Is the claim that if the loss of the function does not change along some curve in the parameter space, then the function itself remains invariant? Why is that?
Then the dimension arises as the scaling exponent of , which can be extracted via the following ratio of volumes formula for some :
This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.
Are you sure this is the correct formula? When I tried computing this by hand it resulted in , but maybe I made a mistake.
General unrelated question: is the following a good intuition for the correspondence of the volume with the effective number of parameters around a singularity? The larger the number of effective parameters around , the more blows up around in all directions because we get variation in all directions, and so the smaller the region where is below . So contributes to this volume. This is in fact what it does in the formulas, by being an exponent for small .
So, in this case the global RLCT is , which we will see in DSLT2 means that the posterior is most concentrated around the singularity .
Do you currently expect that gradient descent will do something similar, where the parameters will move toward singularities with low RLCT? What’s the state of the theory regarding this? (If this is answered in later posts, feel free to just refer to them)
Also, I wonder whether this could be studied experimentally even if the theory is not yet ready: one could probably measure the RLCT around minimal loss points by measuring volumes, and then just check whether gradient descent actually ends up in low-RLCT regions. Maybe this is what you do in later posts. If this is the case, I wonder whether I should be surprised or not: it seems like the lower the RLCT, the larger the number of (fractional) directions where the loss is minimal, and so the larger the basin. So for purely statistical reasons, one may end up in such a region instead of isolated loss-minimizing points of high RLCT.
Andrew Ng wants to have a conversation about extinction risk from AI
https://twitter.com/ai_risks/status/1664323278796898306?s=46&t=umU0Z29c0UEkNxkJx-0kaQ
Apparently Bill Gates signed.
Stating the obvious: Do we expect that Bill Gates will donate money to prevent the extinction from AI?
It’s great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.
Below, I’ll discuss two points of disagreement or where I’m surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.
Most of the post is focused on the outer alignment problem—how do we specify a goal aligned with our intent—and seems to ignore the inner alignment problem—how do we ensure that the specified goal is optimized for.
E.g., he makes an example of us telling the AI to fix climate change, after which the AI wipes out humanity since that fixes climate change more effectively than respecting our implicit constraints of which the AI has no knowledge. In fact, I think language models show that there may be quite some hope that AI models will understand our implicit intent. Under that view, the problem lies at least as much in ensuring that the AI cares.
He also extensively discusses the wireheading problem of entities (e.g., humans, corporations, or AI systems) that try to maximize their reward signal. I think we have reasons to believe that wireheading isn’t as much of a concern: inner misalignment will cause the agent to have some other goal than the precise maximization of the reward function, and once the agent is situationally aware, it has incentives to keep its goals from changing by gradient descent.
He does discuss the fact that our brains reward us for pleasure and avoiding pain, which is misaligned with the evolutionary goal of genetic fitness. In the alignment community, this is most often discussed as an inner alignment issue between the “reward function” of evolution and the “trained agent” being our genomes. However, his discussion highlights that he seems to view it as an outer alignment issue between evolution and our reward signals in the brain, which shape our adult brains through in-lifetime learning. This is also the viewpoint in Brain-Like-AGI Safety, as far as I remember, and also seems related to viewpoints discussed in shard theory.
“In fact, over two decades of work in AI safety suggests that it is difficult to obtain AI alignment [wikipedia], so not obtaining it is clearly possible.”
I agree with the conclusion, but I am surprised by the argument. It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I’m wondering what a much larger community could have done.
Yoshua Bengio was on David Krueger’s PhD thesis committee, according to David’s CV.
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
After filling out the form, I could click on “see previous responses”, which allowed me to see the responses of all other people who have filled out the form so far.
That is probably not intended?
MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.