Apple: AI Reasoning Has Limits

Apple has indicated that AI reasoning models have limitations when it comes to solving complex problems. This challenges the arguments that they are useful for tasks typically done by humans.

Reasoning models are able to tackle more complex problems than standard LLMs because they break them down into smaller problems and solve them individually. Major providers like OpenAI, Anthropic, and Google have promoted the benefits of reasoning models, presenting them as important for enterprise AI.

The paper, titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, specifically mentions OpenAI’s o1ando3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and Google’s latest version of Gemini. The study revealed that while these models efficiently managed low complexity tasks, they took longer to respond, wasted tokens, and gave incorrect answers beyond a certain complexity level.

Flawed Benchmarks

Apple’s research team stated that the current benchmarks used to evaluate large reasoning models (LRMs) are flawed because they focus on coding and mathematical problems. The researchers claimed that the results of these benchmarks are subject to data contamination and don’t give researchers control over the variables.

In response, they created new puzzles for the RLMs that focused on logical reasoning and didn’t require outside knowledge to find an answer. The results showed that the models failed completely beyond a certain level of complexity. The researchers wrote that through experimentation across puzzles, they showed that frontier LRMs face a complete accuracy collapse beyond certain complexities.

Tower of Hanoi

One of the puzzles used was the Tower of Hanoi, in which disks are stacked by size on a rod next to two others. The player must move the tower of disks from the first rod to the third, moving one disk per turn, without placing a larger disk on a smaller one. With three disks, the models solved the problem well. However, beyond a certain number of disks, the accuracy of all models dropped to zero.

Researchers noted that this was the case even when the solution algorithm was provided, implying a limitation inherent to reasoning models. Models were also found to ‘overthink’ even with less complex problems, finding the solution early but then wasting time and tokens considering incorrect paths. Overall, the paper suggests that LRMs are less generalizable than providers suggest, tending to reduce computation as problems get harder.

The research was conducted by Apple researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar, experts in efficient LLM usage, AI and machine learning.

Apple’s research casts doubt on the claims made by AI developers regarding the potential of reasoning models. OpenAI claimed its models could spend more time thinking through problems and refine their thinking process based on mistakes. Similarly, Google has promoted how Gemini 2.5’s reasoning helps it handle complicated problems, and Anthropic claimed that 3.7 Sonnet was designed for real-world tasks rather than math and computer science problems. This research paper questions these claims by showing that LRMs become less accurate when presented with harder tasks.

Earlier LLMs struggled with ‘needle in a haystack’ problems because they prioritize information at the beginning and end of prompts. What we’re seeing here could be the reasoning model equivalent, a fatal flaw that causes them to lose focus and fail when working on a problem for too long. The researchers allowed the models to use their full ‘budget’ for thinking, but beyond a certain complexity level, the models stopped spending tokens on reasoning. Customers may question whether the ‘thinking’ models they’ve been sold are truly up to the task.

Rory Bathgate is Features and Multimedia Editor at ITPro.