I think this essay leaves out an important factor. To contribute to a scientific discourse you not only need to say something that’s correct and novel but you also need to tackle problems that the scientific discourse finds important.
If you are working on a problem that nobody finds important it’s a lot easier to make correct and novel findings than if you are working on a problem where an existing scientific field invest a lot into solving the problem. As a result, I would expect that cases where someone feels like they make a breakthrough finds something novel and correct but that interests nobody is happening frequently.
If I go through the reject post lists, plenty of those try to present an idea that the author thinks is clever without trying to establish a problem that they try to solve that’s actually considered a problem by other people.
I like Larry McEnerney’s talks about scientific writing. Instead of asking the LLM “To what extent is this project scientifically valid?” it’s probably better to ask something like “Is this project solving problems any scientific field considers useful to solve?” Further queries: What field? Who are the experts in the field working on this problem? What would those experts say about my project? (one query per expert)
One key aspect of LLMs is that instead of mailing famous scientists with your ideas and asking them for opinions, the LLM can simulate the scientists. While that doesn’t give you perfect results, you can get a lot of straightforward objections to your project that way.
Visit a frontier LLM that you haven’t talked to about this breakthrough (at present: GPT-5-Thinking, Claude Opus, Gemini-2.5-Pro).
It’s unclear to me why you don’t list Grok in there. It’s on the top of the benchmarks and it’s less focused on sugar coating people’s feelings. Grok4 gives you two queries every two hours for free.
Instead of precommitting how to react to any LLM answer, I would expect it’s better to engage with the actual arguments the LLM makes. If an LLM criticizes a part of a project, thinking about how to fix that aspect is a good idea instead of just trying to take the outside view.
If you ask such a question, asking GPT-5-Thinking, Claude Opus, Gemini-2.5-Pro and Grok4 might be better instead of just asking one of them.
I think this essay leaves out an important factor. To contribute to a scientific discourse you not only need to say something that’s correct and novel but you also need to tackle problems that the scientific discourse finds important.
I agree, but I think it’s out of scope for what I’m doing here — the validity and novelty of an attempted contribution can at least in principle be analyzed fairly objectively, but the importance seems much fuzzier and more subjective.
It’s unclear to me why you don’t list Grok in there. It’s on the top of the benchmarks and it’s less focused on sugar coating people’s feelings.
Partly just my own lack of experience with it. I don’t put much stock in benchmarks these days because they’re gameable and often gamed, so I’m mostly relying on my own experience, and experiments with how good different models are at catching these sorts of problems. I’m actually planning to drop Gemini off the list, because I’ve been able to try the prompt on more cases over the past few years, and Gemini is too willing to drop into sycophantic mode even with the prompt.
Instead of precommitting how to react to any LLM answer, I would expect it’s better to engage with the actual arguments the LLM makes. If an LLM criticizes a part of a project, thinking about how to fix that aspect is a good idea instead of just trying to take the outside view.
The trouble, in cases where someone has been fooled, is that a) they’ve already gotten feedback on the specifics; what they’re missing is analysis of the overall validity. And b) without some level of precommitment, it’s really easy to just dismiss a response if it doesn’t say what people are hoping to hear.
In a sense (although I didn’t think of this when I wrote it) it’s like the scientific method in miniature. You precommit to an experiment and decide how different experimental results will affect your hypothesis (in accordance with conservation of expected evidence), then do the experiment and update your beliefs accordingly. It’s no good if you change your mind about the meaning of the experiment after you run it :)
I agree, but I think it’s out of scope for what I’m doing here — the validity and novelty of an attempted contribution can at least in principle be analyzed fairly objectively, but the importance seems much fuzzier and more subjective.
The idea of seeking objectivity here is not helpful if you want to contribute to the scientific project. I think that Larry McEnerney is good at explaining why that’s the case, but you can also read plenty of Philosophy and History of Science on why that is.
If you want to contribute to the scientific project thinking about how what you doing relates to the scientific project is essential.
I’m not sure what you mean with “validity” and whether it’s a sensible thing to talk about. If you try to optimize for some notion of validity instead of optimizing for doing something that’s valuable to scientists, you doing something like trying to guess the teachers password. You are optimizing for form instead of optimizing for actually creating something valuable.
If you innovate in the method you are using in a way that violates some idea of conventional “validity” but you are providing value, you are doing well. Against Method wasn’t accidently chosen as title. When Feynman was doing his drawings the first reaction of his fellow scientists was that they weren’t “real science”. He ended up getting his Nobel Prize for them.
As far as novelty goes, the query you are proposing isn’t really a good way to determine novelty. To check novelty a better way is not to ask “Is this novel?” but “Is there prior art here?” Today, a good way to check that to run deep research reports. If your deep research request comes back with “I didn’t find anything” that a better signal for novelty than an question asking whether something is novel being answered with “yes”. LLMs don’t like to answer “I didn’t find anything if you let them run deep research request, they are much more willing to say something is novel when you ask them whether it’s novel.
It’s no good if you change your mind about the meaning of the experiment after you run it :)
Actually, a lot of scientific progress happens that way. You run experiments that they have results that you surprise you. You think about how to explain the results that you got and that brings you a better understanding of the problem domain you are interacting with.
If you want to create something intellectually valuable you need to go through the intellectual work of engaging with counter arguments to what you are doing. If an LLM provides a criticism of your work, that criticism might be valid or it isn’t. If what you are doing is highly complex, the LLM might now understand what you are doing and that doesn’t mean that your idea is doomed. Maybe, you can flesh out your idea more clearly. Even if you can’t and the idea provides value it’s still a good idea.
Thanks! I agree with most of what you’re saying to one extent or another, but relative to the fairly narrow thing I’m trying to do, I still maintain it’s out of scope.
It seems possible that we’re imagining very different typical readers. When I look at rejected LW posts that were co-written with LLMs, or posts on r/LLMPhysics, I see problems like values of totally different units being added together (‘to the current level of meaningness we add the number of seconds since the Big Bang’). While it’s difficult to settle on a fully satisfying notion of validity, I think most people who have done any work in the sciences are likely to agree that something like that is invalid. My main goal here is to provide a first-pass way of helping people identify whether they’re doing something that just doesn’t qualify as science under any reasonable notion of that. The idea of discouraging a future Feynman is horrifying, but my experience has been that with my suggested prompt, LLMs still do their best to give projects the benefit of the doubt.
Similarly, while my step 2 uses a simplified and limited sense of the scientific method, I think it’s really important that people who feel they’ve made a breakthrough should be thinking hard about whether their ideas are able to make falsifiable predictions that existing theories don’t. While there may be some cases around the edges where that’s not exactly true — eg as Charlie Steiner suggests, developing a simpler theory that makes the same predictions — the author ought to have least given the issue serious consideration, whereas in many of the instances I’ve seen that’s not the case.
I do strongly encourage people to write better posts on this topic and/or better prompts, and I’ll gladly replace this post with a pointer to those when they exist. But currently there’s nothing (that I could find), and researchers are flooded with claimed breakthroughs, and so this is my time-bounded effort to improve on the situation as it stood.
LLMs still do their best to give projects the benefit of the doubt.
There the saying that the key to doing a successful startup is to find an idea that looks stupid but that isn’t. A startup is successful when it pursues a path that other people reject to pursue but that’s valuable.
In many cases it’s probably the same for scientific breakthroughs. The ideas behind them are not pursued because the experts in the field believe that the ideas are not promising on the surface.
A lot of the posts that you find on r/LLMPhysics and rejected LW posts have the feature of sounding smart on the surface to some lay people (the person interacting with the LLM), but that don’t work. LLMs might have the feature of giving the kind of idea that sounds smart to lay people at the surface the benefit of the doubt but the kind of idea that sounds stupid to everyone on the surface evaluation no benefit of doubt.
I think it’s really important that people who feel they’ve made a breakthrough should be thinking hard about whether their ideas are able to make falsifiable predictions that existing theories don’t.
You can get a PHD in theoretical physics without developing ideas that allow you to make falsifiable predictions.
Making falsifiable predictions is one way to create value for other scientists but it’s not the only one. Larry brings the example of “There are 20 people in this classroom” as a theory, that can be novel (nobody in the literature said anything about the amount of people in this classroom) and makes falsifiable predictions (everyone who counts, will count 20 people) but is completely worthless.
Your standard has both the problem that people whom the physics community gives PHDs don’t meet it and also that plenty of work that does meet it is worthless.
I think the general principle should be that before you try to contact a researcher with your idea of a breakthrough, you should let the LLM simulate the answer of that researcher beforehand and iterate based on the objections that the LLM predicts to come from the researcher.
I think this essay leaves out an important factor. To contribute to a scientific discourse you not only need to say something that’s correct and novel but you also need to tackle problems that the scientific discourse finds important.
If you are working on a problem that nobody finds important it’s a lot easier to make correct and novel findings than if you are working on a problem where an existing scientific field invest a lot into solving the problem. As a result, I would expect that cases where someone feels like they make a breakthrough finds something novel and correct but that interests nobody is happening frequently.
If I go through the reject post lists, plenty of those try to present an idea that the author thinks is clever without trying to establish a problem that they try to solve that’s actually considered a problem by other people.
I like Larry McEnerney’s talks about scientific writing. Instead of asking the LLM “To what extent is this project scientifically valid?” it’s probably better to ask something like “Is this project solving problems any scientific field considers useful to solve?” Further queries: What field? Who are the experts in the field working on this problem? What would those experts say about my project? (one query per expert)
One key aspect of LLMs is that instead of mailing famous scientists with your ideas and asking them for opinions, the LLM can simulate the scientists. While that doesn’t give you perfect results, you can get a lot of straightforward objections to your project that way.
It’s unclear to me why you don’t list Grok in there. It’s on the top of the benchmarks and it’s less focused on sugar coating people’s feelings. Grok4 gives you two queries every two hours for free.
Instead of precommitting how to react to any LLM answer, I would expect it’s better to engage with the actual arguments the LLM makes. If an LLM criticizes a part of a project, thinking about how to fix that aspect is a good idea instead of just trying to take the outside view.
If you ask such a question, asking GPT-5-Thinking, Claude Opus, Gemini-2.5-Pro and Grok4 might be better instead of just asking one of them.
Thanks for the input!
I agree, but I think it’s out of scope for what I’m doing here — the validity and novelty of an attempted contribution can at least in principle be analyzed fairly objectively, but the importance seems much fuzzier and more subjective.
Partly just my own lack of experience with it. I don’t put much stock in benchmarks these days because they’re gameable and often gamed, so I’m mostly relying on my own experience, and experiments with how good different models are at catching these sorts of problems. I’m actually planning to drop Gemini off the list, because I’ve been able to try the prompt on more cases over the past few years, and Gemini is too willing to drop into sycophantic mode even with the prompt.
The trouble, in cases where someone has been fooled, is that a) they’ve already gotten feedback on the specifics; what they’re missing is analysis of the overall validity. And b) without some level of precommitment, it’s really easy to just dismiss a response if it doesn’t say what people are hoping to hear.
In a sense (although I didn’t think of this when I wrote it) it’s like the scientific method in miniature. You precommit to an experiment and decide how different experimental results will affect your hypothesis (in accordance with conservation of expected evidence), then do the experiment and update your beliefs accordingly. It’s no good if you change your mind about the meaning of the experiment after you run it :)
The idea of seeking objectivity here is not helpful if you want to contribute to the scientific project. I think that Larry McEnerney is good at explaining why that’s the case, but you can also read plenty of Philosophy and History of Science on why that is.
If you want to contribute to the scientific project thinking about how what you doing relates to the scientific project is essential.
I’m not sure what you mean with “validity” and whether it’s a sensible thing to talk about. If you try to optimize for some notion of validity instead of optimizing for doing something that’s valuable to scientists, you doing something like trying to guess the teachers password. You are optimizing for form instead of optimizing for actually creating something valuable.
If you innovate in the method you are using in a way that violates some idea of conventional “validity” but you are providing value, you are doing well. Against Method wasn’t accidently chosen as title. When Feynman was doing his drawings the first reaction of his fellow scientists was that they weren’t “real science”. He ended up getting his Nobel Prize for them.
As far as novelty goes, the query you are proposing isn’t really a good way to determine novelty. To check novelty a better way is not to ask “Is this novel?” but “Is there prior art here?” Today, a good way to check that to run deep research reports. If your deep research request comes back with “I didn’t find anything” that a better signal for novelty than an question asking whether something is novel being answered with “yes”. LLMs don’t like to answer “I didn’t find anything if you let them run deep research request, they are much more willing to say something is novel when you ask them whether it’s novel.
Actually, a lot of scientific progress happens that way. You run experiments that they have results that you surprise you. You think about how to explain the results that you got and that brings you a better understanding of the problem domain you are interacting with.
If you want to create something intellectually valuable you need to go through the intellectual work of engaging with counter arguments to what you are doing. If an LLM provides a criticism of your work, that criticism might be valid or it isn’t. If what you are doing is highly complex, the LLM might now understand what you are doing and that doesn’t mean that your idea is doomed. Maybe, you can flesh out your idea more clearly. Even if you can’t and the idea provides value it’s still a good idea.
Thanks! I agree with most of what you’re saying to one extent or another, but relative to the fairly narrow thing I’m trying to do, I still maintain it’s out of scope.
It seems possible that we’re imagining very different typical readers. When I look at rejected LW posts that were co-written with LLMs, or posts on r/LLMPhysics, I see problems like values of totally different units being added together (‘to the current level of meaningness we add the number of seconds since the Big Bang’). While it’s difficult to settle on a fully satisfying notion of validity, I think most people who have done any work in the sciences are likely to agree that something like that is invalid. My main goal here is to provide a first-pass way of helping people identify whether they’re doing something that just doesn’t qualify as science under any reasonable notion of that. The idea of discouraging a future Feynman is horrifying, but my experience has been that with my suggested prompt, LLMs still do their best to give projects the benefit of the doubt.
Similarly, while my step 2 uses a simplified and limited sense of the scientific method, I think it’s really important that people who feel they’ve made a breakthrough should be thinking hard about whether their ideas are able to make falsifiable predictions that existing theories don’t. While there may be some cases around the edges where that’s not exactly true — eg as Charlie Steiner suggests, developing a simpler theory that makes the same predictions — the author ought to have least given the issue serious consideration, whereas in many of the instances I’ve seen that’s not the case.
I do strongly encourage people to write better posts on this topic and/or better prompts, and I’ll gladly replace this post with a pointer to those when they exist. But currently there’s nothing (that I could find), and researchers are flooded with claimed breakthroughs, and so this is my time-bounded effort to improve on the situation as it stood.
There the saying that the key to doing a successful startup is to find an idea that looks stupid but that isn’t. A startup is successful when it pursues a path that other people reject to pursue but that’s valuable.
In many cases it’s probably the same for scientific breakthroughs. The ideas behind them are not pursued because the experts in the field believe that the ideas are not promising on the surface.
A lot of the posts that you find on r/LLMPhysics and rejected LW posts have the feature of sounding smart on the surface to some lay people (the person interacting with the LLM), but that don’t work. LLMs might have the feature of giving the kind of idea that sounds smart to lay people at the surface the benefit of the doubt but the kind of idea that sounds stupid to everyone on the surface evaluation no benefit of doubt.
You can get a PHD in theoretical physics without developing ideas that allow you to make falsifiable predictions.
Making falsifiable predictions is one way to create value for other scientists but it’s not the only one. Larry brings the example of “There are 20 people in this classroom” as a theory, that can be novel (nobody in the literature said anything about the amount of people in this classroom) and makes falsifiable predictions (everyone who counts, will count 20 people) but is completely worthless.
Your standard has both the problem that people whom the physics community gives PHDs don’t meet it and also that plenty of work that does meet it is worthless.
I think the general principle should be that before you try to contact a researcher with your idea of a breakthrough, you should let the LLM simulate the answer of that researcher beforehand and iterate based on the objections that the LLM predicts to come from the researcher.