A short essay on Illusions

“If there is a severe deficit of language,

there will be severe deficit of thought”

Noam Chomsky.

The Reasoning Promise

Beginning with OpenAI’s o1-preview model in September 2024, the pursuit for artificial reasoning took a major leap forward. This was the first AI system explicitly designed to ‘think’ before responding, using chain-of-thought reasoning—a technique first introduced by Wei et al. in 2022. DeepSeek-R1 in January 2025, showed that reasoning capabilities in Large Language Models (LLMs) could be improved solely through reinforcement learning, followed by Anthropic’s Claude 3.7 Sonnet, which presented the first hybrid reasoning architecture in February 2025. These events paved the way, laid the foundations for what would be the minimum expected in future language models, and established what are now known as Large Reasoning Models (LRMs), models designed to show their step-by-step thinking process.

Continuing in 2025, multiple models were presented adding this new “thinking” feature; AI giants like Google and OpenAI presented Gemini 2.5 Pro and o3 models respectively, while Alibaba QwQ-32B followed DeepSeek’s open source models approach. But this rapid progress hit a major roadblock when Apple—who had been harshly criticized for their weak efforts in this AI race—questioned the very foundation of these advances stating that “despite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity” as the foundational argument of the paper, published in June 2025, “The Illusion of Thinking”. This began a discussion among the community and unveiled various viewpoints over Apple’s statement, even making Anthropic publish a critical response only a few days after in their paper “The illusion of the illusion of thinking.”

Having mentioned this brief overview over one of the biggest dilemmas in the AGI promise: Are machines actually capable of thinking on their own? I thought it could be a good chance for mentioning my own thoughts on this matter, as well as reviewing the arguments presented by Apple and the refuting claims it caused.

The Illusion.

With the intent of having a deep understanding of the models’ reasoning capabilities, Apple leveraged a series of, as described by them, innovative experiments that aimed at replacing the traditional math tests, which can usually be contaminated by training data, and designed a set of puzzles that provided a “controlled environment” over the tests. These puzzles, such as the Hanói Tower or River Crossing, let the research team set precisely the desired difficulty and observe not only if the models could reach the solution, but the “thinking process” developed in between.

Observed complexity issues—Retrieved from original paper

Research findings revealed a complete performance collapse in all the reasoning models after reaching a certain complexity. It also identified three different “performance regimens”: on simple tasks, non-thinking LLMs outperformed their thinking counterpart; on medium complexity tasks, thinking models started to show a glimpse of advantage; but when reaching high-complexity tasks, both models failed completely. An “overthinking phenomenon” was also observed, where LRMs often identified the correct solution early, yet continued exploring incorrect alternatives which led to excessive compute costs; counterintuitively, when approaching the highest complexity, models tended to “stop thinking” although still having the resources to keep reasoning.

The Illusion of the Illusion.

Anthropic’s response paper challenged Apple’s conclusions on three fundamental grounds. First, they argued that the observed “collapse” primarily reflected token limit (the maximum length of text these models can process at once) constraints rather than reasoning failures. Models actively recognized when they approached output limits, with evidence showing responses like “The pattern continues, but to avoid making this too long, I’ll stop here.” This suggested models understood the solution pattern but chose to truncate output due to practical constraints, not reasoning limitations.

Second, Anthropic identified a critical flaw in Apple’s River Crossing experiments: instances with 6 or more actors using a boat capacity of 3 were mathematically impossible to solve. By automatically scoring these impossible instances as failures, Apple inadvertently penalized models for correctly recognizing unsolvable problems.

Solution’s listing format—Retrieved from original paper, with annotations

Third, they demonstrated that alternative evaluation approaches restored high performance on previously “failed” problems. When asked to generate algorithmic functions for finding a direct solution instead of exhaustive lists of movements, models achieved high accuracy on Tower of Hanoi that Apple reported as complete failures, completing solutions using 12 times less capacity rather than the total required for an exhaustive enumeration in the original case.

Is thinking really an illusion?

These results settled and reinforced the main question of both studies: Are models actually capable of reasoning, or are they just becoming better at pattern recognition? However, the controversy revealed a more fundamental issue about how we evaluate AI reasoning capabilities. The dispute between Apple and Anthropic highlighted the challenge of distinguishing between genuine reasoning failures and experimental artifacts—a critical consideration for future AI research and development.

My aim with this, as I decided to call it, short essay was to walk through the recent events and big ideas concerning the question of machine thought. As for my own take, I feel we’re still some years away from a true thinking system, even though the latest progress has been a huge step in that pursuit. I also believe a crucial next step is to create better ways of measuring a model’s “reasoning.” Apple set a clear direction here, but they may have been a little biased against their competition. So, while we’re definitely paving the way for what will become AGI, my answer right now is no—I don’t think these models can truly reason… yet.