Obligatory “Thane compulsively digs into the details of this week’s LLM Innovators paper instead of doing anything productive”:
This was still a super impressive result for Claude Sonnet 3.5. Its ideas were almost as good as the human ideas, despite all the Goodhart’s Law issues.
Ehhhh, I’m not particularly interested in grading that on a curve. What was the human performance in this study, in absolute terms? Was the generative process that produced human ideas for this study capable of innovation?
Keep in mind that these ideas weren’t generated as part of an active research loop where the humans improved on them in tandem with trying to execute them, discussed those ideas with other researchers, thought about them over days/weeks, etc. It’s not necessarily the case that the average human performance here was anything impressive.[1][2]
And my interpretation is that yes, the human ideas here were, on average, pretty bad. The human performance was: effectiveness 4.8, excitement 4.5, novelty 4.9, soundness 5.4, overall 4.0. Extracting the meaning of those from the appendix (page 18):
Score Interpretations
(Effectiveness 5) “Mixed results: The method provides mixed results. It works better than baselines on some datasets or metrics, but not consistently across all of them. The gains tend to be very small and not significant.”
(Novelty 5) “Somewhat novel: The idea has some differences from existing work, but the variations are very incremental rather than substantial. It might refine or extend previous ideas but lacks enough originality to justify a new paper on its own.”
(Excitement 4.5) The meaning of 4⁄10 is defined only indirectly, as something between 3 and 5.:
(Excitement 3) “Mediocre: This work makes marginal contributions and is very incremental.”
(Excitement 5) “Leaning negative: It has interesting bits but overall not exciting enough.”
So we get something that’s between 5 and the average of those two. Eh, let’s just round it off to 5.
(Soundness 5.4) Between:
(Soundness 5) “The methodology is mostly reasonable, but there are a few notable gaps in correctness, experimental design, or justification.”
(Soundness 6) “Reasonably sound: The methodology is generally correct and the experiments are reasonable, but some minor technical choices, suboptimal experimental design, or missing details could be further improved.”
(Overall 4) “Ok but not good enough, rejection for major AI conferences.”
I’d say that’s some floor-level performance from the humans. I don’t think it’d be too controversial to say that the average human-published paper is pretty terrible in absolute terms: it does ~nothing to actually push the frontier, it does not constitute useful research, it’s just something researchers do to survive publish-or-perish. Is matching that performance actually indicative of any ability to innovate? I’m doubtful.
And Novelty 5, 4.7 for LLMs? This is not beating the “LLMs can’t innovate” allegations; in fact, it’s evidence for that[3]. A machine that goes into the arXiv database, extracts the idea from some random paper, makes some random-but-not-outright-nonsensical changes to it, and outputs that, would’ve likely attained the same score.
Especially since humans don’t have actual perfect recall of all papers ever published or all research ideas ever suggested on the Internet, and LLMs more or less do. This is biasing math benchmarks towards overestimating LLM capabilities; why assume this isn’t happening here?
Now, the above does have a huge potential hole: maybe we should be looking at the top 5% of AI-generated ideas, instead of the average value? After all, if LLMs could produce a 10⁄10 idea 5% of the time, that would be more than sufficient for transformative impact.
The paper does provide a way to eyeball this: page 25 features plots of score distributions. (Subtlety: each AI idea got 4-5 reviews, and in the bar plot, each dot is a review score, not an average score for a given project across all reviews it got. Also, for perspective: there was 96 reviews total.)
We could see the following regarding AI performance:
Novelty: 3 scores at 8⁄10, a few at 7⁄10, quite a lot at 6⁄10.
Excitement (expected impact): 1 at 8⁄10, 4 at 7⁄10, a bunch at 6⁄10.
Overall score: 3 at 7⁄10, a bunch at 6⁄10.
Score Interpretations
Novelty:
8/10: “Clearly novel: The idea presents major differences from all known existing works. It introduces fresh insights or approaches that significantly advance the topic in a meaningful way.”
7/10: (defined as something between 6⁄10 and 7⁄10).
6/10: “Reasonably novel: The idea introduces notable differences compared to prior work and likely has enough originality to justify a new paper. However, it still builds significantly on existing ideas rather than breaking new ground.”
Excitement:
8/10: “Exciting: It would deepen the community’s understanding or make major progress in this research direction.”
7/10: (defined as something between 6⁄10 and 7⁄10).
6/10: “Learning positive: It is exciting enough to be accepted at a major AI conference, but still has some weaknesses or somewhat incremental.”
Overall:
7/10: “Good idea, would be accepted by major AI conferences.”
6/10: “Marginally above the acceptance threshold of major AI conferences.”
(For perspective: Opus 4 and o3 estimate that major AI conferences accept top 10-15% of papers worldwide. I don’t know any better than that.)
Now, that actually does incrementally update me towards “LLMs may be capable of innovation”. The higher-end values may be noise (in the below sense), but the 6⁄10 tier is pretty solidly reached.
That said: Keep in mind that those are still individual experts’ estimates of novelty/excitement, not ground-truth values. For example, that 8⁄10 idea may be an outlier in the sense of “this one reviewer liked this one idea disproportionally well”, not in the sense of “AI came up with a really good idea”.
Also, uhh… Turns out the ideas initially evaluated and the ideas actually executed-on were not exactly the same ideas. See 5.1 at page 8:
We show the counts of all types of changes made to human and AI ideas in Table 6, where we see that human ideas and AI ideas involve an average of 2.9 and 3.1 changes, respectively. This indicates that only a moderate number of changes are made to both human and AI ideas. Moreover, all of the changes focus on experiment details rather than altering any algorithms proposed in the original ideas. Examples of these changes include switching to benchmarks that are appropriate for the given tasks, updating the backbone models to more recent ones, adding more comprehensive evaluation metrics, specifying any missing hyper-parameters and prompt details, adding stronger baselines, and adding more analysis or ablation studies.
Details on page 27. Some excerpts:
In another example, the AI-generated idea “Sociolinguistic Role-Play Prompting” proposed experiments on OpenSubtitles and XNLI, which were both removed because they don’t contain the sociolinguistic metadata necessary for the proposed experiments. In the AI-generate idea “Adaptive Semantic Masking”, the executor added more datasets, including Jailbreak-Bench and DAN-Forbidden-Questions, apart from AdvBench mentioned in the original idea
This refers to adding or changing baseline methods in the proposed experiments. For example, in the AI-generated idea “Adaptive Contextual Pruning”, the executor added a baseline “RAG using model-based embeddings”. In the AI-generate idea “EntropyGuided Prompt Mutation”, the proposed baseline Monte Carlo Dropout was dropped since it’s infeasible on black-box LLMs.
In multiple projects, executors decided the temperature and top_p values when sampling responses from LLMs, the number of iterations for applying the proposed method, the number of demo examples for in-context learning, and the number of runs when reporting performance.
For fairness’ sake, humans’ ideas were “tweaked” in this manner as well...
But no, I think this nontrivially poisons any conclusions we should draw from this paper. First, such “small tweaks” only sound small; they might have pretty significant impact, and making them may require good taste/research intuitions.
This is something where I’d expect humans to perform significantly better: as in, if the original idea-generators actually sat down to execute their projects, they would’ve likely made these tweaks themselves. LLMs, on the other hand, are pretty bad at recovering their stride in this manner (see e. g. LLMs Play Pokémon performance).
So, summing up:
Average performance of both humans and LLMs here seems abysmal in the absolute sense. According to the metrics the evaluators used, this is pretty much “below useful research”.
Humans likely underperformed or even deliberately sandbagged, compared to the actual in-the-wild performance.
Top 5%-10% performance of LLMs is more interesting. Even discarding potential noisy outliers, we attain “reasonably novel”, “learning-positive”, and “bottom tier of papers published at major AI conferences”.
That said, there are three factors likely biasing the estimates upwards:
Those scores are individual experts’ evaluations, not actual ground-true values, and AIs’ “optimize for what looks good/exciting” may still be at play here.
LLMs have excellent recall of every idea published on the internet; the experts don’t. This biases their performance on math benchmarks upwards, and it’s likely at play here as well.
The actual ideas executed-on were slightly improved from the ideas submitted. The human idea-suggesters would’ve likely been able to do this on their own, whereas LLMs likely wouldn’t have, so this advantaged them.
Overall, this is a better LLM Innovators paper than the usual LLM Innovators papers, methodology-wise. It, like, actually tries to measure what it says it’s trying to measure, instead of playing shell games. I’d be interested in whether LLMs’ performance on this “benchmark” improves as capabilities increase, and if yes, that may be concerning.
But, like, the actual average performance it currently reports is abysmal, and inasmuch as the incrementally better performance of the top decile of AI ideas serves as incremental evidence towards LLMs-can-innovate, all the aforementioned biases serve as evidence that we should expect their actual independent ground-true performance to be incrementally worse. The whole thing then approximately cancels out.
Ad absurdum, if the study generated the ideas by breaking into the researchers’ houses at night, waking them up, and forcing them to immediately generate the idea at gunpoint, that generative process would probably not have been very capable of producing innovations, right? Obviously the paper must have done something much more reasonable, but it was still in the same direction of human underperformance.
For example: would the researchers have actually contributed the best ideas they could come up with? If they arrived at some brilliant idea that would score at Excitement 9⁄10, why would they waste it on this study, instead of keeping it to themselves and then pursuing that research thread on their own?
This is IMO a major problem with all “let’s hire a bunch of experts to design AI benchmarks”. Why would the mathematicians you hired for inventing novel problems give you the actually novel problem-solution pairs if they came up with some, instead of keeping them for personal publication and giving you some boring rehash of a known problem? (Which the LLM then solves by retrieving an obscure memorized StackExchange answer to basically the same problem.)
Indeed, they probably wouldn’t even attempt to produce something actually novel: that’s hard research work which you can’t reliably churn out for a deadline.
Now, you might counter that LLMs could also be arranged into scaffolds and told to iterate on ideas for a while, and that this might significantly improve their performance. I don’t think this has been shown to be true; IIRC Google did something like this and the performance barely improved from just asking LLMs on the spot.
Obligatory “Thane compulsively digs into the details of this week’s LLM Innovators paper instead of doing anything productive”:
Ehhhh, I’m not particularly interested in grading that on a curve. What was the human performance in this study, in absolute terms? Was the generative process that produced human ideas for this study capable of innovation?
Keep in mind that these ideas weren’t generated as part of an active research loop where the humans improved on them in tandem with trying to execute them, discussed those ideas with other researchers, thought about them over days/weeks, etc. It’s not necessarily the case that the average human performance here was anything impressive.[1][2]
And my interpretation is that yes, the human ideas here were, on average, pretty bad. The human performance was: effectiveness 4.8, excitement 4.5, novelty 4.9, soundness 5.4, overall 4.0. Extracting the meaning of those from the appendix (page 18):
Score Interpretations
(Effectiveness 5) “Mixed results: The method provides mixed results. It works better than baselines on some datasets or metrics, but not consistently across all of them. The gains tend to be very small and not significant.”
(Novelty 5) “Somewhat novel: The idea has some differences from existing work, but the variations are very incremental rather than substantial. It might refine or extend previous ideas but lacks enough originality to justify a new paper on its own.”
(Excitement 4.5) The meaning of 4⁄10 is defined only indirectly, as something between 3 and 5.:
(Excitement 3) “Mediocre: This work makes marginal contributions and is very incremental.”
(Excitement 5) “Leaning negative: It has interesting bits but overall not exciting enough.”
So we get something that’s between 5 and the average of those two. Eh, let’s just round it off to 5.
(Soundness 5.4) Between:
(Soundness 5) “The methodology is mostly reasonable, but there are a few notable gaps in correctness, experimental design, or justification.”
(Soundness 6) “Reasonably sound: The methodology is generally correct and the experiments are reasonable, but some minor technical choices, suboptimal experimental design, or missing details could be further improved.”
(Overall 4) “Ok but not good enough, rejection for major AI conferences.”
I’d say that’s some floor-level performance from the humans. I don’t think it’d be too controversial to say that the average human-published paper is pretty terrible in absolute terms: it does ~nothing to actually push the frontier, it does not constitute useful research, it’s just something researchers do to survive publish-or-perish. Is matching that performance actually indicative of any ability to innovate? I’m doubtful.
And Novelty 5, 4.7 for LLMs? This is not beating the “LLMs can’t innovate” allegations; in fact, it’s evidence for that[3]. A machine that goes into the arXiv database, extracts the idea from some random paper, makes some random-but-not-outright-nonsensical changes to it, and outputs that, would’ve likely attained the same score.
Especially since humans don’t have actual perfect recall of all papers ever published or all research ideas ever suggested on the Internet, and LLMs more or less do. This is biasing math benchmarks towards overestimating LLM capabilities; why assume this isn’t happening here?
Now, the above does have a huge potential hole: maybe we should be looking at the top 5% of AI-generated ideas, instead of the average value? After all, if LLMs could produce a 10⁄10 idea 5% of the time, that would be more than sufficient for transformative impact.
The paper does provide a way to eyeball this: page 25 features plots of score distributions. (Subtlety: each AI idea got 4-5 reviews, and in the bar plot, each dot is a review score, not an average score for a given project across all reviews it got. Also, for perspective: there was 96 reviews total.)
We could see the following regarding AI performance:
Novelty: 3 scores at 8⁄10, a few at 7⁄10, quite a lot at 6⁄10.
Excitement (expected impact): 1 at 8⁄10, 4 at 7⁄10, a bunch at 6⁄10.
Overall score: 3 at 7⁄10, a bunch at 6⁄10.
Score Interpretations
Novelty:
8/10: “Clearly novel: The idea presents major differences from all known existing works. It introduces fresh insights or approaches that significantly advance the topic in a meaningful way.”
7/10: (defined as something between 6⁄10 and 7⁄10).
6/10: “Reasonably novel: The idea introduces notable differences compared to prior work and likely has enough originality to justify a new paper. However, it still builds significantly on existing ideas rather than breaking new ground.”
Excitement:
8/10: “Exciting: It would deepen the community’s understanding or make major progress in this research direction.”
7/10: (defined as something between 6⁄10 and 7⁄10).
6/10: “Learning positive: It is exciting enough to be accepted at a major AI conference, but still has some weaknesses or somewhat incremental.”
Overall:
7/10: “Good idea, would be accepted by major AI conferences.”
6/10: “Marginally above the acceptance threshold of major AI conferences.”
(For perspective: Opus 4 and o3 estimate that major AI conferences accept top 10-15% of papers worldwide. I don’t know any better than that.)
Now, that actually does incrementally update me towards “LLMs may be capable of innovation”. The higher-end values may be noise (in the below sense), but the 6⁄10 tier is pretty solidly reached.
That said: Keep in mind that those are still individual experts’ estimates of novelty/excitement, not ground-truth values. For example, that 8⁄10 idea may be an outlier in the sense of “this one reviewer liked this one idea disproportionally well”, not in the sense of “AI came up with a really good idea”.
Also, uhh… Turns out the ideas initially evaluated and the ideas actually executed-on were not exactly the same ideas. See 5.1 at page 8:
Details on page 27. Some excerpts:
For fairness’ sake, humans’ ideas were “tweaked” in this manner as well...
But no, I think this nontrivially poisons any conclusions we should draw from this paper. First, such “small tweaks” only sound small; they might have pretty significant impact, and making them may require good taste/research intuitions.
This is something where I’d expect humans to perform significantly better: as in, if the original idea-generators actually sat down to execute their projects, they would’ve likely made these tweaks themselves. LLMs, on the other hand, are pretty bad at recovering their stride in this manner (see e. g. LLMs Play Pokémon performance).
So, summing up:
Average performance of both humans and LLMs here seems abysmal in the absolute sense. According to the metrics the evaluators used, this is pretty much “below useful research”.
Humans likely underperformed or even deliberately sandbagged, compared to the actual in-the-wild performance.
Top 5%-10% performance of LLMs is more interesting. Even discarding potential noisy outliers, we attain “reasonably novel”, “learning-positive”, and “bottom tier of papers published at major AI conferences”.
That said, there are three factors likely biasing the estimates upwards:
Those scores are individual experts’ evaluations, not actual ground-true values, and AIs’ “optimize for what looks good/exciting” may still be at play here.
LLMs have excellent recall of every idea published on the internet; the experts don’t. This biases their performance on math benchmarks upwards, and it’s likely at play here as well.
The actual ideas executed-on were slightly improved from the ideas submitted. The human idea-suggesters would’ve likely been able to do this on their own, whereas LLMs likely wouldn’t have, so this advantaged them.
Overall, this is a better LLM Innovators paper than the usual LLM Innovators papers, methodology-wise. It, like, actually tries to measure what it says it’s trying to measure, instead of playing shell games. I’d be interested in whether LLMs’ performance on this “benchmark” improves as capabilities increase, and if yes, that may be concerning.
But, like, the actual average performance it currently reports is abysmal, and inasmuch as the incrementally better performance of the top decile of AI ideas serves as incremental evidence towards LLMs-can-innovate, all the aforementioned biases serve as evidence that we should expect their actual independent ground-true performance to be incrementally worse. The whole thing then approximately cancels out.
Ad absurdum, if the study generated the ideas by breaking into the researchers’ houses at night, waking them up, and forcing them to immediately generate the idea at gunpoint, that generative process would probably not have been very capable of producing innovations, right? Obviously the paper must have done something much more reasonable, but it was still in the same direction of human underperformance.
For example: would the researchers have actually contributed the best ideas they could come up with? If they arrived at some brilliant idea that would score at Excitement 9⁄10, why would they waste it on this study, instead of keeping it to themselves and then pursuing that research thread on their own?
This is IMO a major problem with all “let’s hire a bunch of experts to design AI benchmarks”. Why would the mathematicians you hired for inventing novel problems give you the actually novel problem-solution pairs if they came up with some, instead of keeping them for personal publication and giving you some boring rehash of a known problem? (Which the LLM then solves by retrieving an obscure memorized StackExchange answer to basically the same problem.)
Indeed, they probably wouldn’t even attempt to produce something actually novel: that’s hard research work which you can’t reliably churn out for a deadline.
Now, you might counter that LLMs could also be arranged into scaffolds and told to iterate on ideas for a while, and that this might significantly improve their performance. I don’t think this has been shown to be true; IIRC Google did something like this and the performance barely improved from just asking LLMs on the spot.
Inasmuch as we would’ve updated towards LLM innovators if they scored higher here.