News

Apple: AI Reasoning Has Limits

Source: itpro.com

Published on June 10, 2025

Updated on June 10, 2025

An illustration of AI reasoning models struggling with complex problems, represented by a Tower of Hanoi puzzle

Apple: AI Reasoning Has Limits

Apple's recent research has revealed significant limitations in AI reasoning models, particularly when tackling complex problems. These findings challenge the widely held belief that such models are effective for tasks traditionally performed by humans, raising questions about their reliability in enterprise AI applications.

Reasoning models, which are designed to break down complex problems into smaller, manageable parts, have been promoted by major providers like OpenAI, Anthropic, and Google as essential tools for enterprise AI. However, Apple's study suggests that these models struggle to maintain accuracy as problem complexity increases, leading to longer response times, wasted tokens, and incorrect answers.

Flawed Benchmarks

The research team at Apple identified flaws in the current benchmarks used to evaluate large reasoning models (LRMs). These benchmarks often focus on coding and mathematical problems, which are susceptible to data contamination and fail to provide researchers with control over variables. To address this issue, the team created new puzzles that emphasize logical reasoning and do not require external knowledge to solve.

The results of these new benchmarks were striking. The models performed well on low-complexity tasks but experienced a complete accuracy collapse when faced with more challenging problems. This phenomenon was observed across various models, including OpenAI's o1ando3 models, DeepSeek R1, Anthropic's Claude 3.7 Sonnet, and Google's latest version of Gemini.

Tower of Hanoi

One of the puzzles used in the study was the Tower of Hanoi, a classic problem-solving game. In this game, disks of different sizes are stacked on a rod, and the objective is to move the entire stack to another rod, following specific rules. While the models performed well with three disks, their accuracy dropped to zero as the number of disks increased, even when provided with the solution algorithm.

Researchers noted that the models often 'overthought' simpler problems, finding the solution early but then wasting time and tokens exploring incorrect paths. This behavior suggests an inherent limitation in reasoning models, which tend to reduce computational efficiency as problems become more complex.

Implications for Enterprise AI

Apple's findings have significant implications for the enterprise AI sector. Companies that rely on AI reasoning models for complex decision-making processes may need to reevaluate their approaches. The study suggests that these models are less generalizable than providers claim, casting doubt on their ability to handle real-world tasks effectively.

OpenAI, Google, and Anthropic have all promoted the benefits of their reasoning models, claiming they can spend more time thinking through problems and refine their thinking based on mistakes. However, Apple's research challenges these claims, demonstrating that the models become less accurate as tasks become more complex.

In conclusion, Apple's research highlights the need for further investigation into the limitations of AI reasoning models. As the demand for enterprise AI solutions continues to grow, it is essential to develop models that can reliably handle complex problems without sacrificing accuracy or efficiency.