The question is IMO not “has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?” (obviously yes), but “is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?”. A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn’t mean the clock is useful for telling the time or that the RWG has the property of being insightful.
And my current impression is that no, there’s no way to do that. If there were, we would’ve probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it’s very rare, and only happens when I’ve already done 90% of the work setting up the problem. Mostly they’re useful as rubber ducks or primers on existing knowledge; not idea-generators.
Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.
For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as
“has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?” (obviously yes),
& I created the question to see if we could substantiate the “yes” here with evidence.
It makes somewhat more sense to me for your timeline crux to be “can we do this reliably” as opposed to “has this literally ever happened”—but the claim in your post was quite explicit about the “this has literally never happened” version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what’s going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.
This strong position even makes some sense to me; it isn’t totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.
There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line.
Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.
So, if we can get LLMs to lay the groundwork and pose the right questions then we’ll have autonomous scientists in whatever fields LLMs are OK at problem solving.
This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user.
LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions.
So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work.
I think the argument you’re making is that since LLMs can make eps > 0 progress, they can repeat it N times to make unbounded progress. But this is not the structure of conceptual insight as a general rule. Concretely, it fails for the architectural reasons I explained in the original post.
Here’s an attempt at a clearer explanation of my argument:
I think the ability to autonomously find novel problems to solve will emerge as reasoning models scale up. It will emerge because it is instrumental to solving difficult problems.
Imagine an RL environment in which the LLM being trained is tasked with solving somewhat difficult open math problems (solutions verified using autonomous proof verification). It fails and fails at most of them until it learns to focus on making marginal progress: tackling simpler cases, working on tangentially-related problems, etc. These instrumental solutions are themselves often novel, meaning that the LLM will have become able to pose novel, interesting, somewhat important problems autonomously. And this will scale to something like a fully autonomous, very much superhuman researcher.
This is how it often works in humans. We work on a difficult problem, find novel results on the way there. The LLM would likely be uncertain of whether these results are truly novel but this is how it works with humans too. The system can do some DeepResearch / check with relevant experts if it’s important.
Of course, I’m working from my parent-comment’s position that LLMs are in fact already capable of solving novel problems, just not posing them and doing the requisite groundwork.
I think the ability to autonomously find novel problems to solve will emerge as reasoning models scale up. It will emerge because it is instrumental to solving difficult problems.
This of course is not a sufficient reason. (Demonstration: telepathy will emerge [as evolution improves organisms] because it is instrumental to navigating social situations.) It being instrumental means that there is an incentive—or to be more precise, a downward slope in the loss function toward areas of model space with that property—which is one required piece, but it also must be feasible. E.g., if the parameter space doesn’t have any elements that are good at this ability, then it doesn’t matter whether there’s a downward slope.
Fwiw I agree with this:
Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.
… though like you I think posing the right question is the hard part, so imo this is not very informative.
I don’t agree with the underlying assumption then—I don’t think LLMs are capable of solving difficult novel problems, unless you include a nearly-complete solution as part of the groundwork.
If there were, we would’ve probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.
It seems suspicious to me that this hype is coming from fields were it seems hard to verify (is the LLM actually coming up with original ideas or is it just fusing standard procedures? Are the ideas the bottleneck or is the experimental time the bottleneck? Are the ideas actually working or do they just sound impressive?). And of course this is Twitter.
Why not progress on hard (or even easy but open) math problems? Are LLMs afraid of proof verifiers? On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
Is there? It’s one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what’s the ground-truth signal for “the framework of Bayesian probability/category theory is genuinely practically useful”?
This is the reason I’m bearish on the reasoning models even for math. The realistic benefits of them seem to be:
Much faster feedback loops on mathematical conjectures.
Solving long-standing mathematical challenges such as Riemann’s or P vs. NP.
Mathematicians might be able to find hints of whole new math paradigms in the proofs for the long-standing challenges the models generate.
Of those:
(1) still requires mathematicians to figure out which conjectures are useful. It compresses hours, days, weeks, or months (depending on how well it scales) of a very specific and niche type of work into minutes, which is cool, but not Singularity-tier.
(2) is very speculative. It’s basically “compresses decades of work into minutes”, while the current crop of reasoning models can barely solve problems that ought to be pretty “shallow” from their perspective. Maybe Altman is right, the paradigm is in its GPT-2 stage, and we’re all about to be blown away by what they’re capable of. Or maybe it doesn’t scale past the frontier of human mathematical knowledge very well at all, and the parallels with AlphaZero are overstated. We’ll see.
(3) is dependent on (2) working out.
(The reasoning-model hype is so confusing for me. Superficially there’s a ton of potential, but I don’t think there’s been any real indication they’re up to the real challenges still ahead.)
That’s a reasonable suspicion but as a counterpoint there might be more low-hanging fruit in biomedicine than math, precisely because it’s harder to test ideas in the former. Without the need for expensive experiments, math has already been driven much deeper than other fields, and therefore requires a deeper understanding to have any hope of making novel progress.
edit: Also, if I recall correctly, the average IQ of mathematicians is higher than biologists, which is consistent with it being harder to make progress in math.
On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let’s say, an order of magnitude, they will be able to attack more laborious problems.
Not that I expect it to make much difference to the general populace or even the scientific community at large though
The question is IMO not “has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?” (obviously yes), but “is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?”. A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn’t mean the clock is useful for telling the time or that the RWG has the property of being insightful.
And my current impression is that no, there’s no way to do that. If there were, we would’ve probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it’s very rare, and only happens when I’ve already done 90% of the work setting up the problem. Mostly they’re useful as rubber ducks or primers on existing knowledge; not idea-generators.
Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.
For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as
& I created the question to see if we could substantiate the “yes” here with evidence.
It makes somewhat more sense to me for your timeline crux to be “can we do this reliably” as opposed to “has this literally ever happened”—but the claim in your post was quite explicit about the “this has literally never happened” version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what’s going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.
This strong position even makes some sense to me; it isn’t totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.
There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line.
Yeah, that makes sense.
Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.
So, if we can get LLMs to lay the groundwork and pose the right questions then we’ll have autonomous scientists in whatever fields LLMs are OK at problem solving.
This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user.
LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions.
So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work.
I think the argument you’re making is that since LLMs can make eps > 0 progress, they can repeat it N times to make unbounded progress. But this is not the structure of conceptual insight as a general rule. Concretely, it fails for the architectural reasons I explained in the original post.
Here’s an attempt at a clearer explanation of my argument:
I think the ability to autonomously find novel problems to solve will emerge as reasoning models scale up. It will emerge because it is instrumental to solving difficult problems.
Imagine an RL environment in which the LLM being trained is tasked with solving somewhat difficult open math problems (solutions verified using autonomous proof verification). It fails and fails at most of them until it learns to focus on making marginal progress: tackling simpler cases, working on tangentially-related problems, etc. These instrumental solutions are themselves often novel, meaning that the LLM will have become able to pose novel, interesting, somewhat important problems autonomously. And this will scale to something like a fully autonomous, very much superhuman researcher.
This is how it often works in humans. We work on a difficult problem, find novel results on the way there. The LLM would likely be uncertain of whether these results are truly novel but this is how it works with humans too. The system can do some DeepResearch / check with relevant experts if it’s important.
Of course, I’m working from my parent-comment’s position that LLMs are in fact already capable of solving novel problems, just not posing them and doing the requisite groundwork.
This of course is not a sufficient reason. (Demonstration: telepathy will emerge [as evolution improves organisms] because it is instrumental to navigating social situations.) It being instrumental means that there is an incentive—or to be more precise, a downward slope in the loss function toward areas of model space with that property—which is one required piece, but it also must be feasible. E.g., if the parameter space doesn’t have any elements that are good at this ability, then it doesn’t matter whether there’s a downward slope.
Fwiw I agree with this:
… though like you I think posing the right question is the hard part, so imo this is not very informative.
I don’t agree with the underlying assumption then—I don’t think LLMs are capable of solving difficult novel problems, unless you include a nearly-complete solution as part of the groundwork.
I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.
It seems suspicious to me that this hype is coming from fields were it seems hard to verify (is the LLM actually coming up with original ideas or is it just fusing standard procedures? Are the ideas the bottleneck or is the experimental time the bottleneck? Are the ideas actually working or do they just sound impressive?). And of course this is Twitter.
Why not progress on hard (or even easy but open) math problems? Are LLMs afraid of proof verifiers? On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
Is there? It’s one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what’s the ground-truth signal for “the framework of Bayesian probability/category theory is genuinely practically useful”?
This is the reason I’m bearish on the reasoning models even for math. The realistic benefits of them seem to be:
Much faster feedback loops on mathematical conjectures.
Solving long-standing mathematical challenges such as Riemann’s or P vs. NP.
Mathematicians might be able to find hints of whole new math paradigms in the proofs for the long-standing challenges the models generate.
Of those:
(1) still requires mathematicians to figure out which conjectures are useful. It compresses hours, days, weeks, or months (depending on how well it scales) of a very specific and niche type of work into minutes, which is cool, but not Singularity-tier.
(2) is very speculative. It’s basically “compresses decades of work into minutes”, while the current crop of reasoning models can barely solve problems that ought to be pretty “shallow” from their perspective. Maybe Altman is right, the paradigm is in its GPT-2 stage, and we’re all about to be blown away by what they’re capable of. Or maybe it doesn’t scale past the frontier of human mathematical knowledge very well at all, and the parallels with AlphaZero are overstated. We’ll see.
(3) is dependent on (2) working out.
(The reasoning-model hype is so confusing for me. Superficially there’s a ton of potential, but I don’t think there’s been any real indication they’re up to the real challenges still ahead.)
That’s a reasonable suspicion but as a counterpoint there might be more low-hanging fruit in biomedicine than math, precisely because it’s harder to test ideas in the former. Without the need for expensive experiments, math has already been driven much deeper than other fields, and therefore requires a deeper understanding to have any hope of making novel progress.
edit: Also, if I recall correctly, the average IQ of mathematicians is higher than biologists, which is consistent with it being harder to make progress in math.
On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let’s say, an order of magnitude, they will be able to attack more laborious problems.
Not that I expect it to make much difference to the general populace or even the scientific community at large though