AI 'Thinking' Illusion: Apple's Research

New artificial intelligence research from Apple indicates that AI reasoning models might not be as intelligent as believed. A paper released just before Apple's WWDC event reveals that large reasoning models (LRMs), including OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, falter when confronted with increasingly difficult problems.

The research, conducted by the same team that previously identified reasoning flaws in LLMs, suggests limitations in reasoning model intelligence. While LRMs outperformed LLMs on moderately difficult puzzles, they performed worse on simpler ones. Apple's research demonstrated that these models would give up prematurely when faced with complex puzzles, exhibiting what researchers termed "The Illusion of Thinking," especially in complex problems beyond math and coding.

Apple has been cautious in developing large language models and integrating AI into its devices. The company's introduction of Apple Intelligence AI features has been considered somewhat underwhelming, and this research could explain Apple's hesitation to fully embrace AI, unlike Google and Samsung, which have heavily incorporated AI into their devices.

Logic Puzzles and Model Failure

The reasoning models, or LRMs, were evaluated using classic logic puzzles like the Tower of Hanoi, which involves moving a stack of discs from one peg to another without placing a larger disc on a smaller one. Other puzzles included moving checkers, solving the river-crossing problem, and stacking blocks in specific arrangements. These puzzles are commonly used to assess human reasoning and problem-solving skills.

Researchers discovered that LRMs begin to fail after a certain level of complexity is reached. According to the research, all reasoning models display a similar pattern: as problem complexity increases, accuracy declines until it reaches zero beyond a certain threshold. The results showed that Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi puzzle. Applying more computing power to the LRMs did not prevent them from failing at more complex puzzles.

Reasoning Effort and Accuracy

The research also revealed that reasoning models initially increase their thinking tokens as complexity grows, but they eventually give up. The paper indicated that as models approach a critical threshold, they reduce their reasoning effort despite the increasing difficulty of the problem. Even when the LRMs were provided with the answers, their accuracy did not improve.

Despite these findings, it's important to consider them within the context of other LLM research. While LLMs have weaknesses, they also have uses for tasks like coding and writing. Gary Marcus noted that humans also have limits similar to those discovered by the Apple team. The Apple paper indicates that LLMs are not a replacement for well-specified conventional algorithms.