if an LLM could evaluate whether an idea were good or not in new domains, then we could have LLMs generating million of random policy ideas in response to climate change, pandemic control, AI safety etc, then deliver the select best few to our inbox every morning.
seems to me that the bottleneck then is LLM’s judgment of good ideas in new domains. is that right? ability to generate high quality ideas consistently wouldn’t matter, cuz it’s so cheap to generate ideas now.
even if you’re mediocre at coming up with ideas, as long as it’s cheap and you can come up with thousands, one of them is bound to be promising. The question of whether you as an LLM can find a good idea is not whether most of your ideas are good, but whether you can find one good idea in a stack of 1000
Imagine trying to generate a poem by one algorithm creating thousands of random combinations of words, and another algorithm choosing the most poetic among the generated combinations. No matter how good the second algorithm is, it seems quite likely that the first one simply didn’t generate anything valuable.
As the hypothesis gets more complex, the number of options grows exponentially. Imagine a pattern such as “what if X increases/decreases Y by mechanism Z”. If you propose 10 different values for each of X, Y, Z, you already have 1000 hypotheses.
I can imagine finding some low-hanging fruit if we increase the number of hypotheses to millions. But even there, we will probably be limited by lack of experimental data. (Could a diet consisting only of broccoli and peanut butter cure cancer? Maybe, but how is the LLM supposed to find out?) So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.
To get further, we need some new insight. Maybe collecting tons of data in a relatively uniform format, and teaching the LLM to translate its hypotheses into SQL queries it could then verify automatically.
(Even with hypothetical ubiquitous surveillance, you would probably need an extra step where the raw video records are transcribed to textual/numeric data, so that you could run queries on them later.)
“So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.”
Exactly: untested hypotheses that LLMs already have enough data to test. I wonder how rare such hypotheses are.
It strikes me as wild that LLMs have ingested enormous swathes of the internet, across thousands of domains, and haven’t yet produced genius connections between those domains (eg between psychoanalysis and tree root growth). Cross Domain Analogies seem like just one example of ripe category of hypotheses that could be tested with existing LLM knowledge.
Re poetry—I actually wonder if thousands of random phrase combinations might actually be enough for a tactful amalgamator to weave a good poem.
And LLMS do better than random. They aren’t trained well on scientific creativity (interesting hypothesis formation), but they do learn some notion of “good idea,” and reasoners tend to do even better at generating smart novelty when prompted well.
i’m not sure. the question would be, if an LLM comes up with 1000 approaches to an interesting math conjecture, how would we find out if one approach were promising?
one out of the 1000 random ideas would need to be promising, but as importantly, an LLM would need to be able to surface the promising one
if an LLM could evaluate whether an idea were good or not in new domains, then we could have LLMs generating million of random policy ideas in response to climate change, pandemic control, AI safety etc, then deliver the select best few to our inbox every morning.
seems to me that the bottleneck then is LLM’s judgment of good ideas in new domains. is that right? ability to generate high quality ideas consistently wouldn’t matter, cuz it’s so cheap to generate ideas now.
coming up with good ideas is very difficult as well
(and it requires good judgment, also)
even if you’re mediocre at coming up with ideas, as long as it’s cheap and you can come up with thousands, one of them is bound to be promising. The question of whether you as an LLM can find a good idea is not whether most of your ideas are good, but whether you can find one good idea in a stack of 1000
“Thousands” is probably not enough.
Imagine trying to generate a poem by one algorithm creating thousands of random combinations of words, and another algorithm choosing the most poetic among the generated combinations. No matter how good the second algorithm is, it seems quite likely that the first one simply didn’t generate anything valuable.
As the hypothesis gets more complex, the number of options grows exponentially. Imagine a pattern such as “what if X increases/decreases Y by mechanism Z”. If you propose 10 different values for each of X, Y, Z, you already have 1000 hypotheses.
I can imagine finding some low-hanging fruit if we increase the number of hypotheses to millions. But even there, we will probably be limited by lack of experimental data. (Could a diet consisting only of broccoli and peanut butter cure cancer? Maybe, but how is the LLM supposed to find out?) So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.
To get further, we need some new insight. Maybe collecting tons of data in a relatively uniform format, and teaching the LLM to translate its hypotheses into SQL queries it could then verify automatically.
(Even with hypothetical ubiquitous surveillance, you would probably need an extra step where the raw video records are transcribed to textual/numeric data, so that you could run queries on them later.)
“So we would need to find a hypothesis where we accidentally already made all the necessary experiments and even described the intermediate findings (because LLMs are good at words, but probably suck at analyzing the primary data), but we somehow failed to connect the dots. Not impossible, but requires a lot of luck.”
Exactly: untested hypotheses that LLMs already have enough data to test. I wonder how rare such hypotheses are.
It strikes me as wild that LLMs have ingested enormous swathes of the internet, across thousands of domains, and haven’t yet produced genius connections between those domains (eg between psychoanalysis and tree root growth). Cross Domain Analogies seem like just one example of ripe category of hypotheses that could be tested with existing LLM knowledge.
Re poetry—I actually wonder if thousands of random phrase combinations might actually be enough for a tactful amalgamator to weave a good poem.
And LLMS do better than random. They aren’t trained well on scientific creativity (interesting hypothesis formation), but they do learn some notion of “good idea,” and reasoners tend to do even better at generating smart novelty when prompted well.
for ideas which are “big enough”, this is just false, right? for example, so far, no LLM has generated a proof of an interesting conjecture in math
i’m not sure. the question would be, if an LLM comes up with 1000 approaches to an interesting math conjecture, how would we find out if one approach were promising?
one out of the 1000 random ideas would need to be promising, but as importantly, an LLM would need to be able to surface the promising one
which seems the more likely bottleneck?