LLMs Struggle with Long-Horizon Procedural Reasoning: FSM Execution Exposes Algorithmic Unreliability

Large language models (LLMs), despite excelling in tasks framed as reasoning problems, struggle with genuine procedural reasoning involving multi-step, rule-based computations. Researchers have introduced Finite-State Machine (FSM) Execution as a controlled framework to evaluate this capability. The FSM setup requires models to execute step-by-step transitions based on input actions, maintaining state consistency throughout. Empirical results show a systematic decline in performance as task complexity and horizon increase. Models falter when rule retrieval involves high branching factors (many actions per state) compared to long memory spans (many states, few actions). Larger models demonstrate improved local accuracy but remain brittle under multi-step reasoning unless prompted to externalize intermediate steps. These findings expose an "illusion of procedural reasoning": LLMs mimic algorithmic behavior briefly but fail to sustain coherent execution over longer procedures. FSM-based evaluation offers a means to diagnose these failures and guide the design of inductive biases, memory mechanisms, and reasoning scaffolds for genuine long-horizon procedural competence. The study emphasizes grounding "reasoning" in measurable execution fidelity to establish a rigorous foundation for improving LLM algorithmic reliability. The experiments revealed that model scale improves short-horizon execution but doesn't eliminate long-horizon instability. The performance asymmetry between wide (few states, many actions) and deep (many states, few actions) FSMs highlights that failures often stem from rule retrieval under high branching factors. Instruction complexity experiments showed that LLMs struggle to perform multiple sequential steps internally but can be mitigated by prompting the model to externalize intermediate steps.