Study Reveals 'Illusion of Procedural Reasoning' in Large Language Models

Large language models (LLMs) often falter when executing multi-step, rule-based computations, according to a new study. Researchers introduced Finite-State Machine (FSM) Execution as a framework for evaluating procedural reasoning in LLMs, revealing systematic degradation as task complexity increases. The study found that LLMs struggle with long-horizon tasks, exhibiting what researchers call an 'illusion of procedural reasoning.' While models can mimic algorithmic behavior for short traces, they fail to sustain coherent execution as procedural depth grows. Task accuracy declines predictably with the number of actions and input length, and early state-tracking mistakes propagate through subsequent steps, amplifying downstream error rates. Experiments showed that larger models demonstrate improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. Surprisingly, models performed significantly worse when rule retrieval involved high branching factors (many actions per state) compared to situations with long memory spans (many states, few actions). These findings suggest that current LLMs lack stable internal mechanisms for multi-step execution. The researchers propose that FSM-based evaluation offers a transparent method for diagnosing this failure mode and guiding the design of architectures and prompting strategies to improve algorithmic reliability in LLMs.