LLMs Struggle with Long-Horizon Procedural Reasoning: FSM Execution Reveals Limitations

Large language models (LLMs), while impressive in many areas, often struggle with long-horizon procedural reasoning, according to a new study. Researchers from an unmentioned institution have introduced Finite-State Machine (FSM) execution as a framework to evaluate this capacity. The FSM requires the model to execute step-by-step instructions, maintaining state consistency over multiple turns. The study, detailed in a paper titled "The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs," reveals that LLMs exhibit systematic degradation as task horizon or branching complexity increases. Models perform worse when rule retrieval involves high branching factors, suggesting a failure in rule retrieval rather than state memory. Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. Experiments involved testing various Qwen models and Gemini-2.5-flash on FSM configurations. Results indicated a positive correlation between model scale and accuracy. However, even the largest models struggled to maintain long-term fidelity. The study also found that increasing instruction complexity per turn significantly degraded performance, a limitation that can be partially mitigated by enabling model reasoning through prompting. The researchers conclude that current LLMs exhibit an 'illusion of procedural reasoning,' imitating algorithmic control in short contexts but lacking stable internal mechanisms for multi-step execution. They suggest future research should focus on architectures and prompting strategies that incorporate compositional inductive bias, explicit memory, or modular execution scaffolds to enable genuine systematic reasoning.