I’m a last-year PhD student at the University of Amsterdam working on AI Safety and Alignment, and specifically safety risks of Reinforcement Learning from Human Feedback (RLHF). Previously, I also worked on abstract multivariate information theory and equivariant deep learning. https://langleon.github.io/
Leon Lang
I think it would be valuable if someone would write a post that does (parts of) the following:
summarize the landscape of work on getting LLMs to reason.
sketch out the tree of possibilities for how o1 was trained and how it works in inference.
select a “most likely” path in that tree and describe in detail a possibility for how o1 works.
I would find it valuable since it seems important for external safety work to know how frontier models work, since otherwise it is impossible to point out theoretical or conceptual flaws for their alignment approaches.
One caveat: writing such a post could be considered an infohazard. I’m personally not too worried about this since I guess that every big lab is internally doing the same independently, so that the post would not speed up innovation at any of the labs.
Thanks for the post, I agree with the main points.
There is another claim on causality one could make, which would be: LLMs cannot reliably act in the world as robust agents since by acting in the world, you change the world, leading to a distributional shift from the correlational data the LLM encountered during training.
I think that argument is correct, but misses an obvious solution: once you let your LLM act in the world, simply let it predict and learn from the tokens that it receives in response. Then suddenly, the LLM does not model correlational, but actual causal relationships.
Agreed.
I think the most interesting part was that she made a comment that one way to predict a mind is to be a mind, and that that mind will not necessarily have the best of all of humanity as its goal. So she seems to take inner misalignment seriously.
40 min podcast with Anca Dragan who leads safety and alignment at google deepmind: https://youtu.be/ZXA2dmFxXmg?si=Tk0Hgh2RCCC0-C7q
To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs).
Not much more specific! I guess from a certain level of capabilities onward, one could create labels with foundation models that evaluate reasoning steps. This is much more fuzzy than math, but I still guess a person who created a groundbreaking proof assistant would be extremely valuable for any effort that tries to make foundation models reason reliably. And if they’d work at a company like google, then I think their ideas would likely diffuse even if they didn’t want to work on foundation models.
Thanks for your details on how someone could act responsibly in this space! That makes sense. I think one caveat is that proof assistant research might need enormous amounts of compute, and so it’s unclear how to work on it productively outside of a company where the ideas would likely diffuse.
I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.
See also Demis’ Hassabis recent tweet. Admittedly, it’s unclear whether he refers to AlphaProof itself being accessible from Gemini, or the research into AlphaProof feeding into improvements of Gemini.
See also an important paragraph in the blogpost for AlphaProof: “As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.”
Not sure if this was discussed at LW before. This is an opinion piece by Sam Altman, which sounds like a toned down version of “situational awareness” to me.
The news is not very old yet. Lots of potential for people to start freaking out.
One question: Do you think Chinchilla scaling laws are still correct today, or are they not? I would assume these scaling laws depend on the data set used in training, so that if OpenAI found/created a better data set, this might change scaling laws.
Do you agree with this, or do you think it’s false?
https://x.com/sama/status/1813984927622549881
According to Sam Altman, GPT-4o mini is much better than text-davinci-003 was in 2022, but 100 times cheaper. In general, we see increasing competition to produce smaller-sized models with great performance (e.g., Claude Haiku and Sonnet, Gemini 1.5 Flash and Pro, maybe even the full-sized GPT-4o itself). I think this trend is worth discussing. Some comments (mostly just quick takes) and questions I’d like to have answers to:
Should we expect this trend to continue? How much efficiency gains are still possible? Can we expect another 100x efficiency gain in the coming years? Andrej Karpathy expects that we might see a GPT-2 sized “smart” model.
What’s the technical driver behind these advancements? Andrej Karpathy thinks it is based on synthetic data: Larger models curate new, better training data for the next generation of small models. Might there also be architectural changes? Inference tricks? Which of these advancements can continue?
Why are companies pushing into small models? I think in hindsight, this seems easy to answer, but I’m curious what others think: If you have a GPT-4 level model that is much, much cheaper, then you can sell the service to many more people and deeply integrate your model into lots of software on phones, computers, etc. I think this has many desirable effects for AI developers:
Increase revenue, motivating investments into the next generation of LLMs
Increase market-share. Some integrations are probably “sticky” such that if you’re first, you secure revenue for a long time.
Make many people “aware” of potential usecases of even smarter AI so that they’re motivated to sign up for the next generation of more expensive AI.
The company’s inference compute is probably limited (especially for OpenAI, as the market leader) and not many people are convinced to pay a large amount for very intelligent models, meaning that all these reasons beat reasons to publish larger models instead or even additionally.
What does all this mean for the next generation of large models?
Should we expect that efficiency gains in small models translate into efficiency gains in large models, such that a future model with the cost of text-davinci-003 is massively more capable than today’s SOTA? If Andrej Karpathy is right that the small model’s capabilities come from synthetic data generated by larger, smart models, then it’s unclear to me whether one can train SOTA models with these techniques, as this might require an even larger model to already exist.
At what point does it become worthwhile for e.g. OpenAI to publish a next-gen model? Presumably, I’d guess you can still do a lot of “penetration of small model usecases” in the next 1-2 years, leading to massive revenue increases without necessarily releasing a next-gen model.
Do the strategies differ for different companies? OpenAI is the clear market leader, so possibly they can penetrate the market further without first making a “bigger name for themselves”. In contrast, I could imagine that for a company like Anthropic, it’s much more important to get out a clear SOTA model that impresses people and makes them aware of Claude. I thus currently (weakly) expect Anthropic to more strongly push in the direction of SOTA than OpenAI.
I went to this event in 2022 and it was lovely. Will come again this year. I recommend coming!
Thanks for the answer!
But basically, by “simple goals” I mean “goals which are simple to represent”, i.e. goals which have highly compressed representations
It seems to me you are using “compressed” in two very different meanings in part 1 and 2. Or, to be fairer, I interpret the meanings very differently.
I try to make my view of things more concrete to explain:
Compressed representations: A representation is a function from observations of the world state (or sequences of such observations) into a representation space of “features”. That this is “compressed” means (a) that in , only a small number of features are active at any given time and (b) that this small number of features is still sufficient to predict/act in the world.
Goals building on compressed representations: A goal is a (maybe linear) function from the representation space into the real numbers. The goal “likes” some features and “dislikes” others. (Or if it is not entirely linear, then it may like/dislike some simple combinations/compositions of features)
It seems to me that in part 2 of your post, you view goals as compositions . Part 1 says that is highly compressed. But it’s totally unclear to me why the composition should then have the simplicity properties you claim in part 2, which in my mind don’t connect with the compression properties of as I just defined them.
A few more thoughts:
The notion of “simplicity” in part seems to be about how easy it is to represent a function—i.e., the space of parameters with which the function is represented is simple in your part 2.
The notion of “compression” in part 1 seems to be about how easy it is to represent an input—i.e., is there a small number of features such that their activation tells you the important things about the input?
These notions of simplicity and compression are very different. Indeed, if you have a highly compressed representation as in part 1, I’d guess that necessarily lives in a highly complex space of possible functions with many parameters, thus the opposite of what seems to be going on in part 2.
This is largely my fault since I haven’t really defined “representation” very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing “fur”, “mouth”, “nose”, “barks”, etc. Otherwise if we just count “dog” as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn’t seem like a useful definition.
(To put it another way: the representation is the information you need to actually do stuff with the concept.)
I’m confused. Most of the time, when seeing a dog, most of what I need is actually just to know that it is a “dog”, so this is totally sufficient to do something with the concept. E.g., if I walk on the street and wonder “will this thing bark?”, then knowing “my dog neuron activates” is almost enough.
I’m confused for a second reason: It seems like here you want to claim that the “dog” representation is NOT simple (since it contains “fur”, “mouth”, etc.). However, the “dog” representation needs lots of intelligence and should thus come along with compression, and if you equate compression and simplicity, then it seems to me like you’re not consistent. (I feel a bit awkward saying “you’re not consistent”, but I think it’s probably good if I state my honest state of mind at this moment).
To clarify my own position, in line with my definition of compression further above: I think that whether representation is simple/compressed is NOT a property of a single input-output relation (like “pixels of dog gets mapped to dog-neuron being activated”), but instead a property of the whole FUNCTION that maps inputs to representations. This function is compressed if for any given input, only a small number of neurons in the last layer activate, and if these can be used (ideally in a linear way) for further predictions and for evaluating goal-achievement.
I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it’s very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.
Okay, I agree with this, fwiw. :) (Though I may not necessarily agree with claims about how this connects to the rest of the post)
Thanks for the post!
a. How exactly do 1 and 2 interact to produce 3?
I think the claim is along the lines of “highly compressed representations imply simple goals”, but the connection between compressed representations and simple goals has not been argued, unless I missed it. There’s also a chance that I simply misunderstand your post entirely.b. I don’t agree with the following argument:
Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it’s defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we’re talking about molecular squiggles. (By contrast, you can’t evaluate the amount of higher-level goals like “freedom” or “justice” in a nanoscale volume, even in principle.)
The classical ML-algorithm that evaluates features separately in space is a CNN. That doesn’t mean that features in CNNs look for tiny structures, though: The deeper into the CNN you are, the more complex the features get. Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as “is there a dog”, but they apply to large parts of the input image.
Therefore, I’d also expect that what an AGI would care about are ultimately larger-scale structures since the AGI’s features will nontrivially depend on the interaction of larger parts in space (and possibly time, e.g. if the AGI evaluates music, movies, etc.).
c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I’d argue that the premise may be false since it’s unclear to me how what philosophers say they care about (“henonium”) connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)
You should all be using the “Google Scholar PDF reader extension” for Chrome.
Features I like:
References are linked and clickable
You get a table of contents
You can move back after clicking a link with Alt+left
Screenshot:
I guess (but don’t know) that most people who downvote Garrett’s comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it.
I do all of these except 3, and implementing a system like 3 is among my deprioritized things in my ToDo-list. Maybe I should prioritize it.
Yes the first! Thanks for the link!
I really enjoyed reading this post! It’s quite well-written. Thanks for writing it.
The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John’s thread is a bit clarifying on this.One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?
OpenAI would have mentioned if they had reached gold on the IMO.