Study Exposes 'Illusion of Procedural Reasoning' in Large Language Models

Large language models (LLMs) often struggle with procedural reasoning, specifically executing multi-step, rule-based computations over extended horizons. While they perform well on tasks framed as reasoning problems, their ability to faithfully execute long sequences of steps is questionable. Researchers have introduced Finite-State Machine (FSM) Execution as a framework for evaluating the procedural reasoning capacity of LLMs. The FSM framework requires models to execute a defined FSM step-by-step, maintaining state consistency across turns based on input actions. This task isolates the model's ability to apply deterministic transition rules, free from world knowledge dependencies. The study measures both Turn Accuracy (local correctness) and Task Accuracy (global long-horizon correctness) to differentiate immediate computational ability from cumulative state maintenance. Empirical results show a systematic degradation in performance as the task horizon or branching complexity increases. Models struggle when rule retrieval involves high branching factors (many actions per state) compared to situations with long memory spans (many states, few actions). While larger models show improved local accuracy, they remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. These findings expose an 'illusion of procedural reasoning,' where LLMs mimic algorithmic behavior for short traces but fail to sustain coherent execution as procedural depth grows. The research suggests that FSM-based evaluation offers a complexity-controlled probe for diagnosing this failure, potentially guiding the design of inductive biases, memory mechanisms, and reasoning scaffolds that enhance algorithmic reliability in LLMs.