Study Reveals LLMs Struggle with Long-Horizon Procedural Reasoning

Large language models (LLMs) have demonstrated impressive results on various reasoning tasks, but a new study questions their ability to perform genuine procedural reasoning. Researchers introduced Finite-State Machine (FSM) Execution as a framework to evaluate the procedural reasoning capacity of LLMs, requiring them to execute step-by-step based on defined FSMs and maintain state consistency. Findings indicate that LLMs struggle with long-horizon tasks and branching complexity. While larger models show improved local accuracy, they remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. This exposes a consistent "illusion of procedural reasoning," where LLMs mimic algorithmic behavior for short traces but fail to sustain coherent execution as procedural depth grows. The study highlights the importance of differentiating between immediate processing ability (Turn Accuracy) and long-term state-holding capacity (Task Accuracy). Researchers found that models can have high turn accuracy but low task accuracy, indicating that errors accumulate over time. Experiments also revealed that LLMs perform worse when rule retrieval involves high branching factors compared to long memory spans. The results suggest a preference for designing LLM-based systems with many simple decision points rather than a few complex ones. The research suggests that future work should explore methods to equip models with coherent, rule-based computation across extended horizons.