There’s too much to cover here; it touches on a lot of important issues. I think you’re pointing to real gaps that probably will slow things down somewhat; those are “thought assesment” which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking.
Unfortunately, the better term for that is Type 2 thinking, because it’s not a separate system. Similarly, I think assessment also doesn’t require a separate system, just a different use of the same system. For complex reasons centering on that, I’m afraid there are shortcuts that might work all too well. The recent successes of scaffolding for that type of thinking indicate that those gaps might be filled all too easily without breakthroughs.
Here are the concerning examples of progress along those lines with no breakthroughs or even new model training, just careful scaffolding of the sort I envisioned in 2023 that hasn’t yet really been seen so far, outside of these focused use cases. Perplexity nearly matched OpenAIs o3-powered Deep Research in two weeks of work on scaffolding a lesser model. More elaborately and impressively, Google’s AI co-scientist also used a last-gen model to match cutting-edge medical research hypothesis generation and pruning (summarized near the start of this excellent podcast). This addresses your “assessor” function. LLMs are indeed bad at it, until they’re scaffoded with structured prompts to think carefully, iteratively, and critically—and then apparently they can tell good ideas from bad. But they still might have a hard time matching human performance in this area; this is one thing that’s giving me hope for longer timelines.
I have more thoughts on that excellent article. I’ve been thinking more about the gaps you focus on; this expands my timelines almost out to ten years, still on the short end of yours. Talking to identify more detailed cruxes might be worthwhile!
I had somehow missed your linked post (≤10-year Timelines Remain Unlikely Despite DeepSeek and o3) when you posted it a few months ago. It’s great!
There’s too much to cover here; it touches on a lot of important issues. I think you’re pointing to real gaps that probably will slow things down somewhat; those are “thought assesment” which has also been called taste or evaluation, and having adequate skill with sequential thinking or System 2 thinking.
Unfortunately, the better term for that is Type 2 thinking, because it’s not a separate system. Similarly, I think assessment also doesn’t require a separate system, just a different use of the same system. For complex reasons centering on that, I’m afraid there are shortcuts that might work all too well. The recent successes of scaffolding for that type of thinking indicate that those gaps might be filled all too easily without breakthroughs.
Here are the concerning examples of progress along those lines with no breakthroughs or even new model training, just careful scaffolding of the sort I envisioned in 2023 that hasn’t yet really been seen so far, outside of these focused use cases. Perplexity nearly matched OpenAIs o3-powered Deep Research in two weeks of work on scaffolding a lesser model. More elaborately and impressively, Google’s AI co-scientist also used a last-gen model to match cutting-edge medical research hypothesis generation and pruning (summarized near the start of this excellent podcast). This addresses your “assessor” function. LLMs are indeed bad at it, until they’re scaffoded with structured prompts to think carefully, iteratively, and critically—and then apparently they can tell good ideas from bad. But they still might have a hard time matching human performance in this area; this is one thing that’s giving me hope for longer timelines.
I have more thoughts on that excellent article. I’ve been thinking more about the gaps you focus on; this expands my timelines almost out to ten years, still on the short end of yours. Talking to identify more detailed cruxes might be worthwhile!