Do reasoning models really “think” or not? Apple research sparks lively debate, response


Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more


Apple’s machine-learning group set off a rhetorical firestorm earlier this month with its release of “The Illusion of Thinking,” a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning large language models (reasoning LLMs) such as OpenAI’s “o” series and Google’s Gemini-2.5 Pro and Flash Thinking don’t actually engage in independent “thinking” or “reasoning” from generalized first principles learned from their training data.

Instead, the authors contend, these reasoning LLMs are actually performing a kind of “pattern matching” and their apparent reasoning ability seems to fall apart once a task becomes too complex, suggesting that their architecture and performance is not a viable path to improving generative AI to the point that it is artificial generalized intelligence (AGI), which OpenAI defines as a model that outperforms humans at most economically valuable work, or superintelligence, AI even smarter than human beings can comprehend.

ACT NOW: Come discuss the latest LLM advances and research at VB Transform on June 24-25 in SF — limited tickets available. REGISTER NOW

Unsurprisingly, the paper immediately circulated widely among the machine learning community on X and many readers’ initial reactions were to declare that Apple had effectively disproven much of the hype around this class of AI: “Apple just proved AI ‘reasoning’ models like Claude, DeepSeek-R1, and o3-mini don’t actually reason at all,” declared Ruben Hassid, creator of EasyGen, an LLM-driven LinkedIn post auto writing tool. “They just memorize patterns really well.”

But now today, a new paper has emerged, the cheekily titled “The Illusion of The Illusion of Thinking” — importantly, co-authored by a reasoning LLM itself, Claude Opus 4 and Alex Lawsen, a human being and independent AI researcher and technical writer — that includes many criticisms from the larger ML community about the paper and effectively argues that the methodologies and experimental designs the Apple Research team used in their initial work are fundamentally flawed.

While we here at VentureBeat are not ML researchers ourselves and not prepared to say the Apple Researchers are wrong, the debate has certainly been a lively one and the issue about the capabilities of LRMs or reasoner LLMs compared to human thinking seems far from settled.

How the Apple Research study was designed — and what it found

Using four classic planning problems — Tower of Hanoi, Blocks World, River Crossing and Checkers Jumping — Apple’s researchers designed a battery of tasks that forced reasoning models to plan multiple moves ahead and generate complete solutions.

These games were chosen for their long history in cognitive science and AI research and their ability to scale in complexity as more steps or constraints are added. Each puzzle required the models to not just produce a correct final answer, but to explain their thinking along the way using chain-of-thought prompting.

As the puzzles increased in difficulty, the researchers observed a consistent drop in accuracy across multiple leading reasoning models. In the most complex tasks, performance plunged to zero. Notably, the length of the models’ internal reasoning traces—measured by the number of tokens spent thinking through the problem—also began to shrink. Apple’s researchers interpreted this as a sign that the models were abandoning problem-solving altogether once the tasks became too hard, essentially “giving up.”

The timing of the paper’s release, just ahead of Apple’s annual Worldwide Developers Conference (WWDC), added to the impact. It quickly went viral across X, where many interpreted the findings as a high-profile admission that current-generation LLMs are still glorified autocomplete engines, not general-purpose thinkers. This framing, while controversial, drove much of the initial discussion and debate that followed.

Critics take aim on X

Among the most vocal critics of the Apple paper was ML researcher and X user @scaling01 (aka “Lisan al Gaib”), who posted multiple threads dissecting the methodology.

In one widely shared post, Lisan argued that the Apple team conflated token budget failures with reasoning failures, noting that “all models will have 0 accuracy with more than 13 disks simply because they cannot output that much!”

For puzzles like Tower of Hanoi, he emphasized, the output size grows exponentially, while the LLM context windows remain fixed, writing “just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn’t mean Tower of Hanoi is more difficult” and convincingly showed that models like Claude 3 Sonnet and DeepSeek-R1 often produced algorithmically correct strategies in plain text or code—yet were still marked wrong.

Another post highlighted that even breaking the task down into smaller, decomposed steps worsened model performance—not because the models failed to understand, but because they lacked memory of previous moves and strategy.

“The LLM needs the history and a grand strategy,” he wrote, suggesting the real problem was context-window size rather than reasoning.

I raised another important grain of salt myself on X: Apple never benchmarked the model performance against human performance on the same tasks. “Am I missing it, or did you not compare LRMs to human perf[ormance] on [the] same tasks?? If not, how do you know this same drop-off in perf doesn’t happen to people, too?” I asked the researchers directly in a thread tagging the paper’s authors. I also emailed them about this and many other questions, but they have yet to respond.

Others echoed that sentiment, noting that human problem solvers also falter on long, multistep logic puzzles, especially without pen-and-paper tools or memory aids. Without that baseline, Apple’s claim of a fundamental “reasoning collapse” feels ungrounded.

Several researchers also questioned the binary framing of the paper’s title and thesis—drawing a hard line between “pattern matching” and “reasoning.”

Alexander Doria aka Pierre-Carl Langlais, an LLM trainer at energy efficient French AI startup Pleias, said the framing misses the nuance, arguing that models might be learning partial heuristics rather than simply matching patterns.

Ethan Mollick, the AI focused professor at University of Pennsylvania’s Wharton School of Business, called the idea that LLMs are “hitting a wall” premature, likening it to similar claims about “model collapse” that didn’t pan out.

Meanwhile, critics like @arithmoquine were more cynical, suggesting that Apple—behind the curve on LLMs compared to rivals like OpenAI and Google—might be trying to lower expectations,” coming up with research on “how it’s all fake and gay and doesn’t matter anyway” they quipped, pointing out Apple’s reputation with now poorly performing AI products like Siri.

In short, while Apple’s study triggered a meaningful conversation about evaluation rigor, it also exposed a deep rift over how much trust to place in metrics when the test itself might be flawed.

A measurement artifact, or a ceiling?

In other words, the models may have understood the puzzles but ran out of “paper” to write the full solution.

“Token limits, not logic, froze the models,” wrote Carnegie Mellon researcher Rohan Paul in a widely shared thread summarizing the follow-up tests.

Yet not everyone is ready to clear LRMs of the charge. Some observers point out that Apple’s study still revealed three performance regimes — simple tasks where added reasoning hurts, mid-range puzzles where it helps, and high-complexity cases where both standard and “thinking” models crater.

Others view the debate as corporate positioning, noting that Apple’s own on-device “Apple Intelligence” models trail rivals on many public leaderboards.

The rebuttal: “The Illusion of the Illusion of Thinking”

In response to Apple’s claims, a new paper titled “The Illusion of the Illusion of Thinking” was released on arXiv by independent researcher and technical writer Alex Lawsen of the nonprofit Open Philanthropy, in collaboration with Anthropic’s Claude Opus 4.

The paper directly challenges the original study’s conclusion that LLMs fail due to an inherent inability to reason at scale. Instead, the rebuttal presents evidence that the observed performance collapse was largely a by-product of the test setup—not a true limit of reasoning capability.

Lawsen and Claude demonstrate that many of the failures in the Apple study stem from token limitations. For example, in tasks like Tower of Hanoi, the models must print exponentially many steps — over 32,000 moves for just 15 disks — leading them to hit output ceilings.

The rebuttal points out that Apple’s evaluation script penalized these token-overflow outputs as incorrect, even when the models followed a correct solution strategy internally.

The authors also highlight several questionable task constructions in the Apple benchmarks. Some of the River Crossing puzzles, they note, are mathematically unsolvable as posed, and yet model outputs for these cases were still scored. This further calls into question the conclusion that accuracy failures represent cognitive limits rather than structural flaws in the experiments.

To test their theory, Lawsen and Claude ran new experiments allowing models to give compressed, programmatic answers. When asked to output a Lua function that could generate the Tower of Hanoi solution—rather than writing every step line-by-line—models suddenly succeeded on far more complex problems. This shift in format eliminated the collapse entirely, suggesting that the models didn’t fail to reason. They simply failed to conform to an artificial and overly strict rubric.

Why it matters for enterprise decision-makers

The back-and-forth underscores a growing consensus: evaluation design is now as important as model design.

Requiring LRMs to enumerate every step may test their printers more than their planners, while compressed formats, programmatic answers or external scratchpads give a cleaner read on actual reasoning ability.

The episode also highlights practical limits developers face as they ship agentic systems—context windows, output budgets and task formulation can make or break user-visible performance.

For enterprise technical decision makers building applications atop reasoning LLMs, this debate is more than academic. It raises critical questions about where, when, and how to trust these models in production workflows—especially when tasks involve long planning chains or require precise step-by-step output.

If a model appears to “fail” on a complex prompt, the problem may not lie in its reasoning ability, but in how the task is framed, how much output is required, or how much memory the model has access to. This is particularly relevant for industries building tools like copilots, autonomous agents, or decision-support systems, where both interpretability and task complexity can be high.

Understanding the constraints of context windows, token budgets, and the scoring rubrics used in evaluation is essential for reliable system design. Developers may need to consider hybrid solutions that externalize memory, chunk reasoning steps, or use compressed outputs like functions or code instead of full verbal explanations.

Most importantly, the paper’s controversy is a reminder that benchmarking and real-world application are not the same. Enterprise teams should be cautious of over-relying on synthetic benchmarks that don’t reflect practical use cases—or that inadvertently constrain the model’s ability to demonstrate what it knows.

Ultimately, the big takeaway for ML researchers is that before proclaiming an AI milestone—or obituary—make sure the test itself isn’t putting the system in a box too small to think inside.



Source link

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe

Latest Articles