I think the IMO results strongly suggest that AGI-worthiness of LLMs at current or similar scale will no longer be possible to rule out (with human efforts). Currently absence of continual learning makes them clearly non-AGI, and in-context learning doesn’t necessarily get them there with feasible levels of scaling. But some sort of post-training based continual learning likely won’t need more scale, and the difficulty of figuring it out remains unknown, as it only got in the water supply as an important obstruction this year.
The key things from solving IMO-level problems (doesn’t matter if it’s proper gold or not) is difficulty reasonably close to the limit of human ability in a somewhat general domain, and correctness grading being somewhat vague (natural language proofs, not just answers). Which describes most technical problems, so it’s evidence that for most technical problems of various other kinds similar methods of training are not far off from making LLMs capable of solving them, and that LLMs don’t need much more scale to make that happen. (Perhaps they need a little bit more scale to solve such problems efficiently, without wasting a lot of parallel compute on failed attempts.)
More difficult problems that take a lot of time to solve (and depend on learning novel specialized ideas) need continual learning to tackle them. Currently only in-context learning is a straightforward way of getting there, by using contexts with millions or tens of millions of tokens of tool-using reasoning traces, equivalent to years of working on a problem for a human. This doesn’t work very well, and it’s unclear if it will work well enough within the remaining scaling in the near term, with 5 GW training systems and the subsequent slowdown. But it’s not ruled out that continual learning can be implemented in some other way, by automatically post-training the model, in which case it’s not obvious that there is anything at all left to figure out before LLMs at a scale similar to today’s become AGIs.
The way you’re using this concept is poisoning your mind. Generality of a domain does imply that if you can do all the stuff in that domain, then you are generally capable (and, depending, that could imply general intelligence; e.g. if you’ve ruled out GLUT-like things). But if you can do half of the things in the domain and not the other half, then you have to ask whether you’re exhibiting general competence in that domain, vs. competence in some sub-domain and incompetence in the general domain. Making this inference enthymemically is poisoning your mind.
For example, suppose that X is “self-play”. One important thing about self-play is that it’s an infinite source of data, provided in a sort of curriculum of increasing difficulty and complexity. Since we have the idea of self-play, and we have some examples of self-play that are successful (e.g. AlphaZero), aren’t we most of the way to having the full power of self-play? And isn’t the full power of self-play quite powerful, since it’s how evolution made AGI? I would say “doubtful”. The self-play that evolution uses (and the self-play that human children use) is much richer, containing more structural ideas, than the idea of having an agent play a game against a copy of itself.
Most instances of a category are not the most powerful, most general instances of that category. So just because we have, or will soon have, some useful instances of a category, doesn’t strongly imply that we can or will soon be able to harness most of the power of stuff in that category. I’m reminded of the politician’s syllogism: “We must do something. This is something. Therefore, we must do this.”.
What I meant by general domain is that it’s not overly weird in the mental moves that are relevant there, so training methods that can create something that wins IMO are probably not very different from training methods that can create things that solve many other kinds of problems. It’s still a bit weird, high school math with olympiad addons is still a somewhat narrow toolkit, but for technical problems of many other kinds the mental move toolkits are not qualitatively different, even if they are larger. The claim is that solving IMO is a qualitatively new milestone from the point of view of this framing, it’s evidence about AGI potential of LLMs at the near-current scale in a way that previous results were not.
I agree that there could still be gaps and “generality” of IMO isn’t a totalizing magic that prevents existence of crucial remaining gaps. I’m not strongly claiming there aren’t any crucial gaps, just that with IMO as an example it’s no longer obvious there are any, at least as long as the training methods used for IMO can be adopted to those other areas, which isn’t always obviously the case. And of course continual learning could prove extremely hard. But there also isn’t strong evidence that it’s extremely hard yet, because it wasn’t a focus for very long while LLMs at current levels of capabilities were already available. And the capabilities of in-context learning with 50M token contexts and even larger LLMs haven’t been observed yet.
So it’s a question of calibration. There could always be substantial obstructions such that it’s no longer obvious that they are there even though they are. But also at some point there actually aren’t any. So always suspecting currently unobservable crucial obstructions is not the right heuristic either, the prediction of when the problem could actually be solved needs to be allowed to respond to some sort of observable evidence.
What I meant by general domain is that it’s not overly weird in the mental moves that are relevant there, so training methods that can create something that wins IMO are probably not very different from training methods that can create things that solve many other kinds of problems.
I took you to be saying
math is a general domain
IMO is fairly hard math
LLMs did the IMO
therefore LLMs can do well in a general domain
therefore probably maybe LLMs are generally intelligent.
But maybe you instead meant
working out math problems applying known methods is a general domain
?
Anyway, “general domain” still does not make sense here. The step from 4 to 5 is not supported by this concept of “general domain” as you’re applying it here.
It’s certainly possible—but “efficient continual learning” sounds a lot like AGI! So, to say that is the thing missing for AGI is not such a strong statement about the distance left, is it?
I don’t think this is moving goalposts on the current paradigm. The word “continual” seems to have basically replaced “online” since the rise of LLMs—perhaps because they manage a bit of in-context learning which is sort-of-online but not-quite-continual and makes a distinction necessary. However, “a system that learns efficiently over the course of its lifetime” is basically what we always expected from AGI, e.g. this is roughly what Hofstadter claimed was missing in “Fluid Concepts and Creative Analogies” as far back as 1995.
I agree that we can’t rule out roughly current scale LLMs reaching AGI. I just want to guard against the implication (which others may read into your words) that this is some kind of default expectation.
The question for this subthread is the scale of LLMs necessary for first AGIs, what the IMO results say about that. Continual learning through post-training doesn’t obviously require more scale, and IMO is an argument about the current scale being almost sufficient. It could be very difficult conceptually/algorithmically to figure out how to actually do continual learning with automated post-training, but that still doesn’t need to depend on more scale for the underlying LLM, that’s my point about the implications of the IMO results. Before those results, it was far less clear if the current (or near term feasible) scale would be sufficient for the neural net cognitive engine part of the AGI puzzle.
It could be that LLMs can’t get there at the current scale because LLMs can’t get there at any (potentially physical) scale with the current architecture.
So in some sense yes that wouldn’t be a prototypical example of a scale bottleneck.
I think the IMO results strongly suggest that AGI-worthiness of LLMs at current or similar scale will no longer be possible to rule out (with human efforts). Currently absence of continual learning makes them clearly non-AGI, and in-context learning doesn’t necessarily get them there with feasible levels of scaling. But some sort of post-training based continual learning likely won’t need more scale, and the difficulty of figuring it out remains unknown, as it only got in the water supply as an important obstruction this year.
I’m not so confident about this.
It seems to me that IMO problems are not so representative of real-world tasks faced by human level agents.
The key things from solving IMO-level problems (doesn’t matter if it’s proper gold or not) is difficulty reasonably close to the limit of human ability in a somewhat general domain, and correctness grading being somewhat vague (natural language proofs, not just answers). Which describes most technical problems, so it’s evidence that for most technical problems of various other kinds similar methods of training are not far off from making LLMs capable of solving them, and that LLMs don’t need much more scale to make that happen. (Perhaps they need a little bit more scale to solve such problems efficiently, without wasting a lot of parallel compute on failed attempts.)
More difficult problems that take a lot of time to solve (and depend on learning novel specialized ideas) need continual learning to tackle them. Currently only in-context learning is a straightforward way of getting there, by using contexts with millions or tens of millions of tokens of tool-using reasoning traces, equivalent to years of working on a problem for a human. This doesn’t work very well, and it’s unclear if it will work well enough within the remaining scaling in the near term, with 5 GW training systems and the subsequent slowdown. But it’s not ruled out that continual learning can be implemented in some other way, by automatically post-training the model, in which case it’s not obvious that there is anything at all left to figure out before LLMs at a scale similar to today’s become AGIs.
The way you’re using this concept is poisoning your mind. Generality of a domain does imply that if you can do all the stuff in that domain, then you are generally capable (and, depending, that could imply general intelligence; e.g. if you’ve ruled out GLUT-like things). But if you can do half of the things in the domain and not the other half, then you have to ask whether you’re exhibiting general competence in that domain, vs. competence in some sub-domain and incompetence in the general domain. Making this inference enthymemically is poisoning your mind.
Cf. https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions :
What I meant by general domain is that it’s not overly weird in the mental moves that are relevant there, so training methods that can create something that wins IMO are probably not very different from training methods that can create things that solve many other kinds of problems. It’s still a bit weird, high school math with olympiad addons is still a somewhat narrow toolkit, but for technical problems of many other kinds the mental move toolkits are not qualitatively different, even if they are larger. The claim is that solving IMO is a qualitatively new milestone from the point of view of this framing, it’s evidence about AGI potential of LLMs at the near-current scale in a way that previous results were not.
I agree that there could still be gaps and “generality” of IMO isn’t a totalizing magic that prevents existence of crucial remaining gaps. I’m not strongly claiming there aren’t any crucial gaps, just that with IMO as an example it’s no longer obvious there are any, at least as long as the training methods used for IMO can be adopted to those other areas, which isn’t always obviously the case. And of course continual learning could prove extremely hard. But there also isn’t strong evidence that it’s extremely hard yet, because it wasn’t a focus for very long while LLMs at current levels of capabilities were already available. And the capabilities of in-context learning with 50M token contexts and even larger LLMs haven’t been observed yet.
So it’s a question of calibration. There could always be substantial obstructions such that it’s no longer obvious that they are there even though they are. But also at some point there actually aren’t any. So always suspecting currently unobservable crucial obstructions is not the right heuristic either, the prediction of when the problem could actually be solved needs to be allowed to respond to some sort of observable evidence.
I took you to be saying
math is a general domain
IMO is fairly hard math
LLMs did the IMO
therefore LLMs can do well in a general domain
therefore probably maybe LLMs are generally intelligent.
But maybe you instead meant
working out math problems applying known methods is a general domain
?
Anyway, “general domain” still does not make sense here. The step from 4 to 5 is not supported by this concept of “general domain” as you’re applying it here.
It’s certainly possible—but “efficient continual learning” sounds a lot like AGI! So, to say that is the thing missing for AGI is not such a strong statement about the distance left, is it?
I don’t think this is moving goalposts on the current paradigm. The word “continual” seems to have basically replaced “online” since the rise of LLMs—perhaps because they manage a bit of in-context learning which is sort-of-online but not-quite-continual and makes a distinction necessary. However, “a system that learns efficiently over the course of its lifetime” is basically what we always expected from AGI, e.g. this is roughly what Hofstadter claimed was missing in “Fluid Concepts and Creative Analogies” as far back as 1995.
I agree that we can’t rule out roughly current scale LLMs reaching AGI. I just want to guard against the implication (which others may read into your words) that this is some kind of default expectation.
The question for this subthread is the scale of LLMs necessary for first AGIs, what the IMO results say about that. Continual learning through post-training doesn’t obviously require more scale, and IMO is an argument about the current scale being almost sufficient. It could be very difficult conceptually/algorithmically to figure out how to actually do continual learning with automated post-training, but that still doesn’t need to depend on more scale for the underlying LLM, that’s my point about the implications of the IMO results. Before those results, it was far less clear if the current (or near term feasible) scale would be sufficient for the neural net cognitive engine part of the AGI puzzle.
It could be that LLMs can’t get there at the current scale because LLMs can’t get there at any (potentially physical) scale with the current architecture.
So in some sense yes that wouldn’t be a prototypical example of a scale bottleneck.