Except that LLMs do possess at least general knowledge which any model obtained from all the texts that it has read. And their CoT is already likely the way to do some reasoning. In order to assess whether SOTA models have actually learned reasoning, it might be useful to consider benchmarks like ARC-AGI. As I detailed in my quick take, nearly every big model was tested on the benchmark and it looks as if there exist scaling laws, significantly slowing progress down as the cost increases beyond a threshold.
Not sure I understand if you disagree or agree with something. The point of the post above was that LLMs might stop showing the growth as we see now (Kwa et al., ‘Measuring AI Ability to Complete Long Tasks’), not that there is no LLM reasoning at all, general or not.
Except that LLMs do possess at least general knowledge which any model obtained from all the texts that it has read. And their CoT is already likely the way to do some reasoning. In order to assess whether SOTA models have actually learned reasoning, it might be useful to consider benchmarks like ARC-AGI. As I detailed in my quick take, nearly every big model was tested on the benchmark and it looks as if there exist scaling laws, significantly slowing progress down as the cost increases beyond a threshold.
Not sure I understand if you disagree or agree with something. The point of the post above was that LLMs might stop showing the growth as we see now (Kwa et al., ‘Measuring AI Ability to Complete Long Tasks’), not that there is no LLM reasoning at all, general or not.
I guess that you should have done something like my attempt to analyse three benchmarks and their potential scaling laws. The SOTA Pareto frontier of LLMs on the ARC-AGI-1 benchmark and the straight line observed on the ARC-AGI-2 benchmark in high-cost mode could imply that the CoT architecture has its limitations ensuring that using CoT-based LLMs for working with large codebases or generating novel ideas is likely impractical.
I made a conjecture that this is due to problems with attention span where ideas are hard to fit. And another conjecture is that researchers might find out that it’s easy to give neuralese models big internal memory.
See also Tomas B.’s experiment with asking Claude Sonnet 4.5 to write fiction and Random Developer’s comment about Claude’s unoriginal plot and ideas.