I’m surprised to see zero mentions of AlphaEvolve. AlphaEvolve generated novel solutions to math problems, “novel” in the “there are no records of any human ever proposing those specific solutions” sense. Of course, LLMs didn’t generate them unprompted, humans had to do a lot of scaffolding. And it was for problems where it’s easy to verify that the solution is correct; “low messiness” problems if you will. Still, this means that LLMs can generate novel solutions, which seems like a crux for “Can we get to AGI just by incrementally improving LLMs?”.
Please provide more detail about this example. What did the system invent? How did the system work? What makes you think it’s novel? Would it have worked without the LLM?
(All of the previous many times someone said something of the form “actually XYZ was evidence of generality / creativity / deep learning being awesome / etc.”, and I’ve spent time looking into the details, it turns out that they were giving a quite poor summary of the result, in favor of making the thing sound more scary / impressive. Or maybe using a much lower bar for lots of descriptor words. But anyway, please be specific.)
Example: matrix multiplication using fewer multiplication operations.
There were also combinatorics problems, “packing” problems (like multiple hexagons inside a bigger hexagon), and others. All of that is in the paper.
Also, “This automated approach enables AlphaEvolve to discover a heuristic that yields an average 23% kernel speedup across all kernels over the existing expert-designed heuristic, and a corresponding 1% reduction in Gemini’s overall training time.”
How did the system work?
It’s essentially an evolutionary/genetic algorithm, with LLMs providing “mutations” for the code. Then the code is automatically evaluated, bad solutions are discarded, and good solutions are kept.
What makes you think it’s novel?
These solutions weren’t previously discovered by humans. Unless the authors just couldn’t find the right references, of course, but I assume the authors were diligent.
Would it have worked without the LLM?
You mean, “could humans have discovered them, given enough time and effort?”. Yes, most likely.
Um, ok, were any of the examples impressive? For example, did any of the examples derive their improvement by some way other than chewing through bits of algebraicness? (The answer could easily be yes without being impressive, for example by applying some obvious known idea to some problem that simply hadn’t happened to have that idea applied to it before, but that’s a good search criterion.)
Ok gotcha, thanks. In that case it doesn’t seem super relevant to me. I would expect there to be lots of gains in any areas where there’s algebraicness to chew through; and I don’t think this indicates much about whether we’re getting AGI. Being able to “unlock” domains, so that you can now chew through algebraicness there, does weakly indicate something, but it’s a very fuzzy signal IMO.
(For contrast, a behavior such as originarily producing math concepts has a large non-algebraic component, and would IMO be a fairly strong indicator of general intelligence.)
Very little / hard to evaluate. I have been doing my best to carefully avoid saying things like “do math/science research”, unless speaking really loosely, because I believe that’s quite a poor category. It’s like “programming”; sure, there’s a lot in common between writing a CRUD app or tweaking a UI, but it’s really not the same thing as “think of a genuinely novel algorithm and implement it effectively in context”. Quoting from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions :
Most instances of a category are not the most powerful, most general instances of that category. So just because we have, or will soon have, some useful instances of a category, doesn’t strongly imply that we can or will soon be able to harness most of the power of stuff in that category. I’m reminded of the politician’s syllogism: “We must do something. This is something. Therefore, we must do this.”.
So the outcome of “think of a novel math concept that mathematicians then find interesting” is nontrivially more narrow than “prove something nontrivial that hasn’t been proved before”. I haven’t evaluated the actual results though and don’t know how they work. If mathematicians starting reading off lots of concepts directly from the LLMs’s reasonings and find them interesting and start using them in further theory, that would be surprising / alarming / confusing, yeah—though it also wouldn’t be mindblowing if it turns out that “a novel math concept that mathematicians then find interesting” is also a somewhat poor category in the same way, and actually there are such things to be found without it being much of a step toward general intelligence.
I took it as obvious that this sort of thing wouldn’t meet Tsvi’s bar. AlphaEvolve seems quite unsurprising to me. We have seen other examples of using LLMs to guide program search. Tsvi and I do have disagreements about how far that sort of thing can take us, but I don’t think AlphaEvolve provides clear evidence on that question. Of course LLMs can concentrate the probability mass moderately well, improving brute-force search. Not clear how far that can take us.
I’m surprised to see zero mentions of AlphaEvolve. AlphaEvolve generated novel solutions to math problems, “novel” in the “there are no records of any human ever proposing those specific solutions” sense. Of course, LLMs didn’t generate them unprompted, humans had to do a lot of scaffolding. And it was for problems where it’s easy to verify that the solution is correct; “low messiness” problems if you will. Still, this means that LLMs can generate novel solutions, which seems like a crux for “Can we get to AGI just by incrementally improving LLMs?”.
Please provide more detail about this example. What did the system invent? How did the system work? What makes you think it’s novel? Would it have worked without the LLM?
(All of the previous many times someone said something of the form “actually XYZ was evidence of generality / creativity / deep learning being awesome / etc.”, and I’ve spent time looking into the details, it turns out that they were giving a quite poor summary of the result, in favor of making the thing sound more scary / impressive. Or maybe using a much lower bar for lots of descriptor words. But anyway, please be specific.)
https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
https://arxiv.org/pdf/2506.13131
Example: matrix multiplication using fewer multiplication operations.
There were also combinatorics problems, “packing” problems (like multiple hexagons inside a bigger hexagon), and others. All of that is in the paper.
Also, “This automated approach enables AlphaEvolve to discover a heuristic that yields an average 23% kernel speedup across all kernels over the existing expert-designed heuristic, and a corresponding 1% reduction in Gemini’s overall training time.”
It’s essentially an evolutionary/genetic algorithm, with LLMs providing “mutations” for the code. Then the code is automatically evaluated, bad solutions are discarded, and good solutions are kept.
These solutions weren’t previously discovered by humans. Unless the authors just couldn’t find the right references, of course, but I assume the authors were diligent.
You mean, “could humans have discovered them, given enough time and effort?”. Yes, most likely.
Um, ok, were any of the examples impressive? For example, did any of the examples derive their improvement by some way other than chewing through bits of algebraicness? (The answer could easily be yes without being impressive, for example by applying some obvious known idea to some problem that simply hadn’t happened to have that idea applied to it before, but that’s a good search criterion.)
I don’t think so.
Ok gotcha, thanks. In that case it doesn’t seem super relevant to me. I would expect there to be lots of gains in any areas where there’s algebraicness to chew through; and I don’t think this indicates much about whether we’re getting AGI. Being able to “unlock” domains, so that you can now chew through algebraicness there, does weakly indicate something, but it’s a very fuzzy signal IMO.
(For contrast, a behavior such as originarily producing math concepts has a large non-algebraic component, and would IMO be a fairly strong indicator of general intelligence.)
A bit of a necrocomment, but I’d like to know if LLMs solving unsolved math problems has changed your mind.
Erdos problems 205 and 1051: AI contributions to Erdős problems · teorth/erdosproblems Wiki. Note: I don’t know what LLM Aristotle is based on, but Aletheia is based on Gemini.
Also this paper: [2512.14575] Extremal descendant integrals on moduli spaces of curves: An inequality discovered and proved in collaboration with AI
Very little / hard to evaluate. I have been doing my best to carefully avoid saying things like “do math/science research”, unless speaking really loosely, because I believe that’s quite a poor category. It’s like “programming”; sure, there’s a lot in common between writing a CRUD app or tweaking a UI, but it’s really not the same thing as “think of a genuinely novel algorithm and implement it effectively in context”. Quoting from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions :
So the outcome of “think of a novel math concept that mathematicians then find interesting” is nontrivially more narrow than “prove something nontrivial that hasn’t been proved before”. I haven’t evaluated the actual results though and don’t know how they work. If mathematicians starting reading off lots of concepts directly from the LLMs’s reasonings and find them interesting and start using them in further theory, that would be surprising / alarming / confusing, yeah—though it also wouldn’t be mindblowing if it turns out that “a novel math concept that mathematicians then find interesting” is also a somewhat poor category in the same way, and actually there are such things to be found without it being much of a step toward general intelligence.
I took it as obvious that this sort of thing wouldn’t meet Tsvi’s bar. AlphaEvolve seems quite unsurprising to me. We have seen other examples of using LLMs to guide program search. Tsvi and I do have disagreements about how far that sort of thing can take us, but I don’t think AlphaEvolve provides clear evidence on that question. Of course LLMs can concentrate the probability mass moderately well, improving brute-force search. Not clear how far that can take us.