Another quick way to get at my skepticism for LLM capability: the ability do differentiate good from bad ideas. ImE the only way LLMs can do this (or at least seem to do this) is the prevailing views are different depending on context—like if there’s an answer popular among lay people and a better answer popular among academics, and your tone makes it clear that you belong in the second category, then it will give you the second one, and that can make it seem like it can tell good and bad takes apart. But this clearly doesn’t count. If the most popular view is the same across contexts it’ll echo that view no matter how dumb it is. I’d guess most people here have rolled their eyes before at GPT regurgitating bad popular takes.
If there’s an update to LLMs where you can ask it a question and it’ll just outright tell you that the prevailing view on this is dumb (and it is in fact dumb) then I shall be properly terrified. Of course this is another non-objective benchmark, but as I said in my post, I think being not-objective may just be a property that all the actually important milestones have.
There’s too much to cover here; it touches on a lot of important issues. I think you’re pointing to real gaps that probably will slow things down somewhat; those are “thought assesment” which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking.
Unfortunately, the better term for that is Type 2 thinking, because it’s not a separate system. Similarly, I think assessment also doesn’t require a separate system, just a different use of the same system. For complex reasons centering on that, I’m afraid there are shortcuts that might work all too well. The recent successes of scaffolding for that type of thinking indicate that those gaps might be filled all too easily without breakthroughs.
Here are the concerning examples of progress along those lines with no breakthroughs or even new model training, just careful scaffolding of the sort I envisioned in 2023 that hasn’t yet really been seen so far, outside of these focused use cases. Perplexity nearly matched OpenAIs o3-powered Deep Research in two weeks of work on scaffolding a lesser model. More elaborately and impressively, Google’s AI co-scientist also used a last-gen model to match cutting-edge medical research hypothesis generation and pruning (summarized near the start of this excellent podcast). This addresses your “assessor” function. LLMs are indeed bad at it, until they’re scaffoded with structured prompts to think carefully, iteratively, and critically—and then apparently they can tell good ideas from bad. But they still might have a hard time matching human performance in this area; this is one thing that’s giving me hope for longer timelines.
I have more thoughts on that excellent article. I’ve been thinking more about the gaps you focus on; this expands my timelines almost out to ten years, still on the short end of yours. Talking to identify more detailed cruxes might be worthwhile!
A related take I heard a while ago: LLMs have strongly superhuman declarative knowledge across countless different subject areas. Any human with that much knowledge would be able to come up with many new theories or insights from combining knowledge from different fields. But LLMs apparently can’t do this. They don’t seem to synthesize, integrate and systematize their knowledge much.
Though maybe they have some latent ability to do this, and they only need some special sort of fine-tuning to unlock it, similar to how reasoning training seems to elicit abilities the base models already have.
Another quick way to get at my skepticism for LLM capability: the ability do differentiate good from bad ideas. ImE the only way LLMs can do this (or at least seem to do this) is the prevailing views are different depending on context—like if there’s an answer popular among lay people and a better answer popular among academics, and your tone makes it clear that you belong in the second category, then it will give you the second one, and that can make it seem like it can tell good and bad takes apart. But this clearly doesn’t count. If the most popular view is the same across contexts it’ll echo that view no matter how dumb it is. I’d guess most people here have rolled their eyes before at GPT regurgitating bad popular takes.
If there’s an update to LLMs where you can ask it a question and it’ll just outright tell you that the prevailing view on this is dumb (and it is in fact dumb) then I shall be properly terrified. Of course this is another non-objective benchmark, but as I said in my post, I think being not-objective may just be a property that all the actually important milestones have.
I had somehow missed your linked post (≤10-year Timelines Remain Unlikely Despite DeepSeek and o3) when you posted it a few months ago. It’s great!
There’s too much to cover here; it touches on a lot of important issues. I think you’re pointing to real gaps that probably will slow things down somewhat; those are “thought assesment” which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking.
Unfortunately, the better term for that is Type 2 thinking, because it’s not a separate system. Similarly, I think assessment also doesn’t require a separate system, just a different use of the same system. For complex reasons centering on that, I’m afraid there are shortcuts that might work all too well. The recent successes of scaffolding for that type of thinking indicate that those gaps might be filled all too easily without breakthroughs.
Here are the concerning examples of progress along those lines with no breakthroughs or even new model training, just careful scaffolding of the sort I envisioned in 2023 that hasn’t yet really been seen so far, outside of these focused use cases. Perplexity nearly matched OpenAIs o3-powered Deep Research in two weeks of work on scaffolding a lesser model. More elaborately and impressively, Google’s AI co-scientist also used a last-gen model to match cutting-edge medical research hypothesis generation and pruning (summarized near the start of this excellent podcast). This addresses your “assessor” function. LLMs are indeed bad at it, until they’re scaffoded with structured prompts to think carefully, iteratively, and critically—and then apparently they can tell good ideas from bad. But they still might have a hard time matching human performance in this area; this is one thing that’s giving me hope for longer timelines.
I have more thoughts on that excellent article. I’ve been thinking more about the gaps you focus on; this expands my timelines almost out to ten years, still on the short end of yours. Talking to identify more detailed cruxes might be worthwhile!
A related take I heard a while ago: LLMs have strongly superhuman declarative knowledge across countless different subject areas. Any human with that much knowledge would be able to come up with many new theories or insights from combining knowledge from different fields. But LLMs apparently can’t do this. They don’t seem to synthesize, integrate and systematize their knowledge much.
Though maybe they have some latent ability to do this, and they only need some special sort of fine-tuning to unlock it, similar to how reasoning training seems to elicit abilities the base models already have.