UPDATE: the blog post I referred to from Scott Aaronson has now been updated to reflect that A) the trick from GPT-5 “should have been obvious” but more importantly a human has already come up with a better trick (directly replacing the one from GPT-5) which resolves an open problem left in the paper. To me, this opens the possibility that GPT-5 may have net slowed down the overall research process (do we actually want papers to be written faster, but of possibly-lower quality?) though I would guess it was still a productivity booster. Still, I am less impressed with this result than I was at first.
To be clear, I wasn’t talking about the kind of weak original insight like saying something technically novel and true but totally trivial or uninteresting (like, say, calculating the value of an expression that was maybe never calculated before but could be calculated with standard techniques). Obviously, this is kind of a blurred line, but I don’t think it’s an empty claim at all: an LLM could falsify my claim by outputting the proof of a conjecture that mathematicians were interested in.
At the time, IMO no one could come up with a convincing counter example. Now, I think the situation is a lot less clear, and it’s very possible that this will in retrospect be the LAST time that I can reasonably claim that what I said holds up. For instance, GPT-5 apparently helped Scott Aaronson prove a significant result: https://scottaaronson.blog/?p=9183#comments
This required some back and forth iteration where it made confident mistakes he had to correct. And, it’s possible that this tiny part of the problem didn’t require original thinking on its own.
However, it’s also possible that I am actually just on copium and should admit I was wrong (or at least, what I said then is wrong now). I’m not sure. Anything slightly more convincing than this would be enough to change my mind.
I’m aware of various small improvements to combinatorial bounds, usually either from specialized systems or not hard enough to be interesting enough or (usually) both. Has anyone seen anything beyond this (and beyond Aaronson’s example)?
For my part, I now (somewhat newly) find LLMs useful as a sort of fuzzy search engine which can be used before the real search engine to figure out what to search, which includes usefulness for research, but certainly does not include DOING research.
Some signal: Daniel Litt, the mathematician who seems most clued-in regarding LLM use, still doesn’t think there have been any instances of LLMs coming up with new ideas.
I’m currently watching this space closely, but I don’t think anything so far has violated my model. LLMs may end up useful for math in the “prove/disprove this conjecture” way, but not in the “come up with new math concepts (/ideas)” way.
Ah, though perhaps our cruxes there differed from the beginning, if you count “prove a new useful conjecture” as a “novel insight”. IMO, that’d only make them good interactive theorem provers, and wouldn’t bear much on the question of “can they close the loop on R&D/power the Singularity”.
Meta: I find the tone of your LW posts and comments to be really good in some way and I want to give positive feedback and try to articulate what I like about the vibe.
I’d describe it as something like: independent thinking/somewhat original takes expressed in a chill, friendly manner, without deliberate contrarianism, and scout mindset but not in a performative way. Also no dunking on stuff. But still pushing back on arguments you disagree with.
To my tastes this is basically optimal. Hope this comment doesn’t make you overthink it in the future. And maybe this can provide some value to others who are thinking about what style to aim for/promote.
Edit: maybe a short name for it is being intellectually disagreeable while being socially agreeable? Idk that’s probably an oversimplification.
I think it might be worthwhile to distinguish cases where LLMs came up with a novel insight on their own vs. were involved, but not solely responsible.
You wouldn’t credit Google for the breakthrough of a researcher who used Google when making a discovery, even if the discovery wouldn’t have happened without the Google searches. The discovery maybe also wouldn’t have happened without the eggs and toast the researcher had for breakfast.
“LLMs supply ample shallow thinking and memory while the humans supply the deep thinking” is a different and currently much more believable claim than “LLMs can do deep thinking to come up with novel insights on their own.”
In my view, you don’t get novel insights without deep thinking except extremely rarely by random, but you’re right to make sure the topic doesn’t shift without anyone noticing.
Full Scott Aaronson quote in case anyone else is interested:
This is the first paper I’ve ever put out for which a key technical step in the proof of the main result came from AI—specifically, from GPT5-Thinking. Here was the situation: we had an N×N Hermitian matrix E(θ) (where, say, N=2n), each of whose entries was a poly(n)-degree trigonometric polynomial in a real parameter θ. We needed to study the largest eigenvalue of E(θ), as θ varied from 0 to 1, to show that this λmax(E(θ)) couldn’t start out close to 0 but then spend a long time “hanging out” ridiculously close to 1, like 1/exp(exp(exp(n))) close for example.
Given a week or two to try out ideas and search the literature, I’m pretty sure that Freek and I could’ve solved this problem ourselves. Instead, though, I simply asked GPT5-Thinking. After five minutes, it gave me something confident, plausible-looking, and (I could tell) wrong. But rather than laughing at the silly AI like a skeptic might do, I told GPT5 how I knew it was wrong. It thought some more, apologized, and tried again, and gave me something better. So it went for a few iterations, much like interacting with a grad student or colleague. Within a half hour, it had suggested to look at the function
(the expression doesn’t copy-paste properly)
It pointed out, correctly, that this was a rational function in θ of controllable degree, that happened to encode the relevant information about how close the largest eigenvalue λmax(E(θ)) is to 1. And this … worked, as we could easily check ourselves with no AI assistance. And I mean, maybe GPT5 had seen this or a similar construction somewhere in its training data. But there’s not the slightest doubt that, if a student had given it to me, I would’ve called it clever. Obvious with hindsight, but many such ideas are.
I had tried similar problems a year ago, with the then-new GPT reasoning models, but I didn’t get results that were nearly as good. Now, in September 2025, I’m here to tell you that AI has finally come for what my experience tells me is the most quintessentially human of all human intellectual activities: namely, proving oracle separations between quantum complexity classes.
“Here we present Copilot for Real-world Experimental Scientists (CRESt), a platform that integrates large multimodal models (LMMs, incorporating chemical compositions, text embeddings, and microstructural images) with Knowledge-Assisted Bayesian Optimization (KABO) and robotic automation. [...] CRESt explored over 900 catalyst chemistries and 3500 electrochemical tests within 3 months, identifying a state-of-the-art catalyst in the octonary chemical space (Pd–Pt–Cu–Au–Ir–Ce–Nb–Cr) which exhibits a 9.3-fold improvement in cost-specific performance.”
“We leveraged frontier genome language models, Evo 1 and Evo 2, to generate whole-genome sequences with realistic genetic architectures and desirable host tropism [...] Experimental testing of AI-generated genomes yielded 16 viable phages with substantial evolutionary novelty. [...] This work provides a blueprint for the design of diverse synthetic bacteriophages and, more broadly, lays a foundation for the generative design of useful living systems at the genome scale.”
FWIW, my understanding is that Evo 2 is not a generic language model that is able to produce innovations, it’s a transformer model trained on a mountain of genetic data which gave it the ability to produce new functional genomes. The distinction is important, see a very similar case of GPT-4b.
UPDATE: the blog post I referred to from Scott Aaronson has now been updated to reflect that A) the trick from GPT-5 “should have been obvious” but more importantly a human has already come up with a better trick (directly replacing the one from GPT-5) which resolves an open problem left in the paper. To me, this opens the possibility that GPT-5 may have net slowed down the overall research process (do we actually want papers to be written faster, but of possibly-lower quality?) though I would guess it was still a productivity booster. Still, I am less impressed with this result than I was at first.
Awhile back, I claimed that LLMs had not produced original insights, resulting in this question: https://www.lesswrong.com/posts/GADJFwHzNZKg2Ndti/have-llms-generated-novel-insights
To be clear, I wasn’t talking about the kind of weak original insight like saying something technically novel and true but totally trivial or uninteresting (like, say, calculating the value of an expression that was maybe never calculated before but could be calculated with standard techniques). Obviously, this is kind of a blurred line, but I don’t think it’s an empty claim at all: an LLM could falsify my claim by outputting the proof of a conjecture that mathematicians were interested in.
At the time, IMO no one could come up with a convincing counter example. Now, I think the situation is a lot less clear, and it’s very possible that this will in retrospect be the LAST time that I can reasonably claim that what I said holds up. For instance, GPT-5 apparently helped Scott Aaronson prove a significant result: https://scottaaronson.blog/?p=9183#comments
This required some back and forth iteration where it made confident mistakes he had to correct. And, it’s possible that this tiny part of the problem didn’t require original thinking on its own.
However, it’s also possible that I am actually just on copium and should admit I was wrong (or at least, what I said then is wrong now). I’m not sure. Anything slightly more convincing than this would be enough to change my mind.
I’m aware of various small improvements to combinatorial bounds, usually either from specialized systems or not hard enough to be interesting enough or (usually) both. Has anyone seen anything beyond this (and beyond Aaronson’s example)?
For my part, I now (somewhat newly) find LLMs useful as a sort of fuzzy search engine which can be used before the real search engine to figure out what to search, which includes usefulness for research, but certainly does not include DOING research.
Some signal: Daniel Litt, the mathematician who seems most clued-in regarding LLM use, still doesn’t think there have been any instances of LLMs coming up with new ideas.
I’m currently watching this space closely, but I don’t think anything so far has violated my model. LLMs may end up useful for math in the “prove/disprove this conjecture” way, but not in the “come up with new math concepts (/ideas)” way.
Ah, though perhaps our cruxes there differed from the beginning, if you count “prove a new useful conjecture” as a “novel insight”. IMO, that’d only make them good interactive theorem provers, and wouldn’t bear much on the question of “can they close the loop on R&D/power the Singularity”.
Meta: I find the tone of your LW posts and comments to be really good in some way and I want to give positive feedback and try to articulate what I like about the vibe.
I’d describe it as something like: independent thinking/somewhat original takes expressed in a chill, friendly manner, without deliberate contrarianism, and scout mindset but not in a performative way. Also no dunking on stuff. But still pushing back on arguments you disagree with.
To my tastes this is basically optimal. Hope this comment doesn’t make you overthink it in the future. And maybe this can provide some value to others who are thinking about what style to aim for/promote.
Edit: maybe a short name for it is being intellectually disagreeable while being socially agreeable? Idk that’s probably an oversimplification.
I think it might be worthwhile to distinguish cases where LLMs came up with a novel insight on their own vs. were involved, but not solely responsible.
You wouldn’t credit Google for the breakthrough of a researcher who used Google when making a discovery, even if the discovery wouldn’t have happened without the Google searches. The discovery maybe also wouldn’t have happened without the eggs and toast the researcher had for breakfast.
“LLMs supply ample shallow thinking and memory while the humans supply the deep thinking” is a different and currently much more believable claim than “LLMs can do deep thinking to come up with novel insights on their own.”
I agree, I just want to note if the goalposts are moving from “no novel insights” to “no deep thinking.”
In my view, you don’t get novel insights without deep thinking except extremely rarely by random, but you’re right to make sure the topic doesn’t shift without anyone noticing.
Full Scott Aaronson quote in case anyone else is interested:
(couldn’t resist including that last sentence)
A couple more (recent) results that may be relevant pieces of evidence for this update:
A multimodal robotic platform for multi-element electrocatalyst discovery
“Here we present Copilot for Real-world Experimental Scientists (CRESt), a platform that integrates large multimodal models (LMMs, incorporating chemical compositions, text embeddings, and microstructural images) with Knowledge-Assisted Bayesian Optimization (KABO) and robotic automation. [...] CRESt explored over 900 catalyst chemistries and 3500 electrochemical tests within 3 months, identifying a state-of-the-art catalyst in the octonary chemical space (Pd–Pt–Cu–Au–Ir–Ce–Nb–Cr) which exhibits a 9.3-fold improvement in cost-specific performance.”
Generative design of novel bacteriophages with genome language models
“We leveraged frontier genome language models, Evo 1 and Evo 2, to generate whole-genome sequences with realistic genetic architectures and desirable host tropism [...] Experimental testing of AI-generated genomes yielded 16 viable phages with substantial evolutionary novelty. [...] This work provides a blueprint for the design of diverse synthetic bacteriophages and, more broadly, lays a foundation for the generative design of useful living systems at the genome scale.”
I don’t feel equipped to assess this.
FWIW, my understanding is that Evo 2 is not a generic language model that is able to produce innovations, it’s a transformer model trained on a mountain of genetic data which gave it the ability to produce new functional genomes. The distinction is important, see a very similar case of GPT-4b.
This may help with the second one:
https://www.lesswrong.com/posts/k5JEA4yFyDzgffqaL/guess-i-was-wrong-about-aixbio-risks
How about this one?
https://scottaaronson.blog/?p=9183
That appears to be the same one I linked.
Though possibly you grabbed the link in a superior way (not to comments).