I guess that you should have done something like my attempt to analyse three benchmarks and their potential scaling laws. The SOTA Pareto frontier of LLMs on the ARC-AGI-1 benchmark and the straight line observed on the ARC-AGI-2 benchmark in high-cost mode could imply that the CoT architecture has its limitations ensuring that using CoT-based LLMs for working with large codebases or generating novel ideas is likely impractical.
I made a conjecture that this is due to problems with attention span where ideas are hard to fit. And another conjecture is that researchers might find out that it’s easy to give neuralese models big internal memory.
See also Tomas B.’s experiment with asking Claude Sonnet 4.5 to write fiction and Random Developer’s comment about Claude’s unoriginal plot and ideas.
I guess that you should have done something like my attempt to analyse three benchmarks and their potential scaling laws. The SOTA Pareto frontier of LLMs on the ARC-AGI-1 benchmark and the straight line observed on the ARC-AGI-2 benchmark in high-cost mode could imply that the CoT architecture has its limitations ensuring that using CoT-based LLMs for working with large codebases or generating novel ideas is likely impractical.
I made a conjecture that this is due to problems with attention span where ideas are hard to fit. And another conjecture is that researchers might find out that it’s easy to give neuralese models big internal memory.
See also Tomas B.’s experiment with asking Claude Sonnet 4.5 to write fiction and Random Developer’s comment about Claude’s unoriginal plot and ideas.