Taylor G. Lunt comments on My AI Predictions for 2027

Taylor G. Lunt 1 Sep 2025 14:40 UTC
2 points
1
I explained in my post that I believe the benchmarks are mainly measuring shallow thinking. The benchmarks include things like completing a single word of code or solving arithmetic problems. These unambiguously fall within what I described as shallow thinking. They measure existing judgement/knowledge, not the ability to form new insights.

Deep thinking has not progressed hyper-exponentially. LLMs are essentially shrimp-level when it comes to deep thinking, in my opinion. LLMs still make extremely basic mistakes that a human 5-year-old would never make. This is undeniable if you actually use them for solving problems.

One can’t simply point out the ways in which the things that LLMs cannot currently do are hard in a way in which the things that LLMs currently can do are not

The distinction between deep and shallow thinking is real and fundamental. Deep thinking is non-polynomial in its time complexity. I’m not moving the goalposts to include only whatever LLMs happen to be bad at right now. They have always been bad at deep thinking, and continue to be. All the gains measured by the benchmarks are gains in shallow thinking.

To be convincing, you have to make an argument that fundamentally differentiates your objection from past failed objections.

I believe I have done so, by claiming deep thinking is of a fundamentally different nature than shallow thinking, and denying any significant progress has been made on this front.

If you disagree, fine. Like I said, I can’t prove anything, I’m just putting forward a hypothesis. But you don’t get to say I’ve been proven wrong. If you want to come up with some way of measuring deep thinking and prove LLMs are or are not good at it, go ahead. Until that work has been done, I haven’t been proven wrong, and we can’t say either way.

(Certain things are easy to measure/benchmark, and these things tend to also require only shallow thinking. Things that require deep thinking are hard to measure for the same reason they require deep thinking, and so they don’t make it into benchmarks. The only way I know how to measure deep thinking is personal judgement, which obviously isn’t convincing. But the fact this work is hard to do doesn’t mean we just conclude that I’m wrong and you’re right.)