I’m a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/
Leon Lang
One question: Do you think Chinchilla scaling laws are still correct today, or are they not? I would assume these scaling laws depend on the data set used in training, so that if OpenAI found/created a better data set, this might change scaling laws.
Do you agree with this, or do you think it’s false?
https://x.com/sama/status/1813984927622549881
According to Sam Altman, GPT-4o mini is much better than text-davinci-003 was in 2022, but 100 times cheaper. In general, we see increasing competition to produce smaller-sized models with great performance (e.g., Claude Haiku and Sonnet, Gemini 1.5 Flash and Pro, maybe even the full-sized GPT-4o itself). I think this trend is worth discussing. Some comments (mostly just quick takes) and questions I’d like to have answers to:
Should we expect this trend to continue? How much efficiency gains are still possible? Can we expect another 100x efficiency gain in the coming years? Andrej Karpathy expects that we might see a GPT-2 sized “smart” model.
What’s the technical driver behind these advancements? Andrej Karpathy thinks it is based on synthetic data: Larger models curate new, better training data for the next generation of small models. Might there also be architectural changes? Inference tricks? Which of these advancements can continue?
Why are companies pushing into small models? I think in hindsight, this seems easy to answer, but I’m curious what others think: If you have a GPT-4 level model that is much, much cheaper, then you can sell the service to many more people and deeply integrate your model into lots of software on phones, computers, etc. I think this has many desirable effects for AI developers:
Increase revenue, motivating investments into the next generation of LLMs
Increase market-share. Some integrations are probably “sticky” such that if you’re first, you secure revenue for a long time.
Make many people “aware” of potential usecases of even smarter AI so that they’re motivated to sign up for the next generation of more expensive AI.
The company’s inference compute is probably limited (especially for OpenAI, as the market leader) and not many people are convinced to pay a large amount for very intelligent models, meaning that all these reasons beat reasons to publish larger models instead or even additionally.
What does all this mean for the next generation of large models?
Should we expect that efficiency gains in small models translate into efficiency gains in large models, such that a future model with the cost of text-davinci-003 is massively more capable than today’s SOTA? If Andrej Karpathy is right that the small model’s capabilities come from synthetic data generated by larger, smart models, then it’s unclear to me whether one can train SOTA models with these techniques, as this might require an even larger model to already exist.
At what point does it become worthwhile for e.g. OpenAI to publish a next-gen model? Presumably, I’d guess you can still do a lot of “penetration of small model usecases” in the next 1-2 years, leading to massive revenue increases without necessarily releasing a next-gen model.
Do the strategies differ for different companies? OpenAI is the clear market leader, so possibly they can penetrate the market further without first making a “bigger name for themselves”. In contrast, I could imagine that for a company like Anthropic, it’s much more important to get out a clear SOTA model that impresses people and makes them aware of Claude. I thus currently (weakly) expect Anthropic to more strongly push in the direction of SOTA than OpenAI.
I went to this event in 2022 and it was lovely. Will come again this year. I recommend coming!
Thanks for the answer!
But basically, by “simple goals” I mean “goals which are simple to represent”, i.e. goals which have highly compressed representations
It seems to me you are using “compressed” in two very different meanings in part 1 and 2. Or, to be fairer, I interpret the meanings very differently.
I try to make my view of things more concrete to explain:
Compressed representations: A representation is a function from observations of the world state (or sequences of such observations) into a representation space of “features”. That this is “compressed” means (a) that in , only a small number of features are active at any given time and (b) that this small number of features is still sufficient to predict/act in the world.
Goals building on compressed representations: A goal is a (maybe linear) function from the representation space into the real numbers. The goal “likes” some features and “dislikes” others. (Or if it is not entirely linear, then it may like/dislike some simple combinations/compositions of features)
It seems to me that in part 2 of your post, you view goals as compositions . Part 1 says that is highly compressed. But it’s totally unclear to me why the composition should then have the simplicity properties you claim in part 2, which in my mind don’t connect with the compression properties of as I just defined them.
A few more thoughts:
The notion of “simplicity” in part seems to be about how easy it is to represent a function—i.e., the space of parameters with which the function is represented is simple in your part 2.
The notion of “compression” in part 1 seems to be about how easy it is to represent an input—i.e., is there a small number of features such that their activation tells you the important things about the input?
These notions of simplicity and compression are very different. Indeed, if you have a highly compressed representation as in part 1, I’d guess that necessarily lives in a highly complex space of possible functions with many parameters, thus the opposite of what seems to be going on in part 2.
This is largely my fault since I haven’t really defined “representation” very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing “fur”, “mouth”, “nose”, “barks”, etc. Otherwise if we just count “dog” as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn’t seem like a useful definition.
(To put it another way: the representation is the information you need to actually do stuff with the concept.)
I’m confused. Most of the time, when seeing a dog, most of what I need is actually just to know that it is a “dog”, so this is totally sufficient to do something with the concept. E.g., if I walk on the street and wonder “will this thing bark?”, then knowing “my dog neuron activates” is almost enough.
I’m confused for a second reason: It seems like here you want to claim that the “dog” representation is NOT simple (since it contains “fur”, “mouth”, etc.). However, the “dog” representation needs lots of intelligence and should thus come along with compression, and if you equate compression and simplicity, then it seems to me like you’re not consistent. (I feel a bit awkward saying “you’re not consistent”, but I think it’s probably good if I state my honest state of mind at this moment).
To clarify my own position, in line with my definition of compression further above: I think that whether representation is simple/compressed is NOT a property of a single input-output relation (like “pixels of dog gets mapped to dog-neuron being activated”), but instead a property of the whole FUNCTION that maps inputs to representations. This function is compressed if for any given input, only a small number of neurons in the last layer activate, and if these can be used (ideally in a linear way) for further predictions and for evaluating goal-achievement.
I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it’s very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.
Okay, I agree with this, fwiw. :) (Though I may not necessarily agree with claims about how this connects to the rest of the post)
Thanks for the post!
a. How exactly do 1 and 2 interact to produce 3?
I think the claim is along the lines of “highly compressed representations imply simple goals”, but the connection between compressed representations and simple goals has not been argued, unless I missed it. There’s also a chance that I simply misunderstand your post entirely.b. I don’t agree with the following argument:
Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it’s defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we’re talking about molecular squiggles. (By contrast, you can’t evaluate the amount of higher-level goals like “freedom” or “justice” in a nanoscale volume, even in principle.)
The classical ML-algorithm that evaluates features separately in space is a CNN. That doesn’t mean that features in CNNs look for tiny structures, though: The deeper into the CNN you are, the more complex the features get. Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as “is there a dog”, but they apply to large parts of the input image.
Therefore, I’d also expect that what an AGI would care about are ultimately larger-scale structures since the AGI’s features will nontrivially depend on the interaction of larger parts in space (and possibly time, e.g. if the AGI evaluates music, movies, etc.).
c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I’d argue that the premise may be false since it’s unclear to me how what philosophers say they care about (“henonium”) connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)
You should all be using the “Google Scholar PDF reader extension” for Chrome.
Features I like:
References are linked and clickable
You get a table of contents
You can move back after clicking a link with Alt+left
Screenshot:
I guess (but don’t know) that most people who downvote Garrett’s comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it.
I do all of these except 3, and implementing a system like 3 is among my deprioritized things in my ToDo-list. Maybe I should prioritize it.
Yes the first! Thanks for the link!
I really enjoyed reading this post! It’s quite well-written. Thanks for writing it.
The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John’s thread is a bit clarifying on this.One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?
MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.
(Fwiw, I don’t remember problems with stipend payout at seri mats in the winter program. I was a winter scholar 2022⁄23.)
We Should Prepare for a Larger Representation of Academia in AI Safety
This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.
Thanks for the reply!
As I show in the examples in DSLT1, having degenerate Fisher information (i.e. degenerate Hessian at zeroes) comes in two essential flavours: having rank-deficiency, and having vanishing second-derivative (i.e. ). Precisely, suppose is the number of parameters, then you are in the regular case if can be expressed as a full-rank quadratic form near each singularity,
Anything less than this is a strictly singular case.
So if , then is a singularity but not a strict singularity, do you agree? It still feels like somewhat bad terminology to me, but maybe it’s justified from the algebraic-geometry—perspective.
Zeta Functions in Singular Learning Theory
In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.
The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by
where is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by
where is the Kullback-Leibler divergence.
But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about.
The answer: by computing a different integral. So now, I’ll explain the connection to different integrals we can draw.
Let
which is called the state density function. Here, is the Dirac delta function. For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the “size” of different level sets. This state density function is connected to two different things.
Laplace Transform to the Evidence
First of all, it is connected to the evidence above. Namely, let be the Laplace transform of . It is a function given by
In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .
Mellin Transform to the Zeta Function
But how do we compute ? By using another transform. Let be the Mellin transform of . It is a function (or maybe only defined on part of ?) given by
Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.
What’s this useful for?
The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as
Thus, we essentially changed our problem to the problem of studying the zeta function To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.
Thanks for the answer! I think my first question was confused because I didn’t realize you were talking about local free energies instead of the global one :)
As discussed in the comment in your DSLT1 question, they are both singularities of since they are both critical points (local minima).
Oh, I actually may have missed that aspect of your answer back then. I’m confused by that: in algebraic geometry, the zero’s of a set of polynomials are not necessarily already singularities. E.g., in , the zero set consists of the two axes, which form an algebraic variety, but only at is there a singularity because the derivative disappears.
Now, for the KL-divergence, the situation seems more extreme: The zero’s are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?So far, I thought “being singular” means the effective number of parameters around the singularity is lower than the full number of parameters. Also, I thought that it’s about the rank of the Hessian, not the vanishing of the derivative. Both perspectives contradict the interpretation in the preceding paragraph, which leaves me confused.
The uninteresting answer is that SLT doesn’t care about the prior (other than its regularity conditions) since it is irrelevant in the limit.
I vaguely remember that there is a part in the MDL book by Grünwald where he explains how using a good prior such as Jeffrey’s prior somewhat changes asymptotic behavior for , but I’m not certain of that.
Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)
At some critical value , we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error .
“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
There is, however, one fundamentally different kind of “phase transition” that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time—the closest quantity is the number of datapoints , but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
Which altered the posterior geometry, but not that of since (up to a normalisation factor).
I didn’t understand this footnote.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior.We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
The weight annihilation phase is never preferred by the posterior
I think there is a typo and you meant .
Thanks Liam also for this nice post! The explanations were quite clear.
The property of being singular is specific to a model class , regardless of the underlying truth.
This holds for singularities that come from symmetries where the model doesn’t change. However, is it correct that we need the “underlying truth” to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
Both configurations, non-weight-annihilation (left) and weight-annihilation (right)
What do you mean with non-weight-annihilation here? Don’t the weights annihilate in both pictures?
The news is not very old yet. Lots of potential for people to start freaking out.