LLMs Struggle with Long-Horizon Procedural Reasoning, Exposing an 'Illusion of Competence'

Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks, but a new study questions their true ability to perform procedural reasoning—executing multi-step, rule-based computations over long horizons. Researchers Mahdi Samiei, Mahdi Mansouri, and Mahdieh Soleymani Baghshah introduce Finite-State Machine (FSM) execution as a controlled framework for evaluating the procedural reasoning capacity of LLMs. The FSM setup requires models to follow an explicit FSM definition and execute it step-by-step given input actions, maintaining state consistency across multiple turns. This task requires no prior world knowledge, focusing instead on the model's ability to apply deterministic transition rules faithfully. The study measured both Turn Accuracy (local correctness) and Task Accuracy (global long-horizon correctness) to differentiate between immediate computation and cumulative state maintenance. The results reveal systematic degradation in performance as the task horizon or branching complexity increases. Models performed significantly worse with high branching factors (many actions per state) than with long memory spans (many states, few actions). Larger models showed improved local accuracy but remained brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. The findings suggest that current LLMs exhibit an "illusion of procedural reasoning," mimicking algorithmic behavior for short traces but failing to sustain coherent execution as procedural depth grows. The FSM-based evaluation provides a transparent probe for diagnosing this failure mode and guiding the development of strategies to improve the algorithmic reliability of LLMs.