LLMs Struggle with Long-Horizon Reasoning: FSM Execution Exposes Algorithmic Unreliability

Large language models (LLMs) have shown remarkable results on reasoning tasks, but their ability to perform procedural reasoning—executing multi-step, rule-based computations—remains questionable. Unlike algorithmic systems that deterministically execute long procedures, LLMs often falter under extended reasoning chains. Researchers have introduced Finite-State Machine (FSM) Execution as a framework for evaluating the procedural reasoning capacity of LLMs. In this setup, the model receives an explicit FSM definition and must execute it step-by-step based on input actions, maintaining state consistency. This task requires applying deterministic transition rules, directly probing the model's procedural fidelity. Empirical results show systematic degradation as task horizon or branching complexity increases. Models perform worse when rule retrieval involves high branching factors (many actions per state) compared to long memory spans (many states, few actions). Larger models improve local accuracy but remain brittle in multi-step reasoning unless prompted to externalize intermediate steps. These findings expose a consistent "illusion of procedural reasoning": LLMs can mimic algorithmic behavior for short traces but fail to sustain coherent execution as procedural depth grows. FSM-based evaluation offers a complexity-controlled probe for diagnosing this failure mode and guiding designs for improved algorithmic reliability of LLMs. Researchers found that even the largest model tested (Qwen3-235B) only reached about 50% overall task accuracy, indicating persistent long-horizon degradation. The research also highlights that workflows with many simple decision points (many states, few actions) are preferable to those with few complex decision points (few states, many actions) when building LLM-based systems. Instruction complexity further impacts performance, with models struggling to process multiple sequential actions in one turn. This can be mitigated by enabling model reasoning and prompting them to externalize intermediate steps.