LLMs Face Challenges in Long-Horizon Procedural Reasoning

Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks, but a new study reveals significant limitations in their ability to perform long-horizon procedural reasoning. Researchers from [Institution Name] used finite-state machine (FSM) execution as a framework to evaluate LLMs, focusing on their capacity to handle multi-step, rule-based computations.

The study found that LLMs struggle with extended reasoning chains, with performance degrading as task complexity increases. This issue is particularly pronounced when tasks involve high branching factors (many actions per state) compared to long memory spans (many states, few actions). Even larger models, such as Qwen3-235B, achieved only about 50% accuracy in these tasks, highlighting persistent challenges.

Key Findings

LLMs mimic algorithmic behavior for short tasks but fail in longer, more complex procedures.
Performance worsens with high branching complexity and sequential actions.
Larger models show improved local accuracy but remain brittle in multi-step reasoning.

The researchers introduced the concept of "negative self-conditioning," where errors compound and degrade the model's ability to perform subsequent steps correctly. This phenomenon was observed across various Qwen models, with a positive correlation between model scale and accuracy, though the improvements were limited.

Implications and Solutions

The study concludes that FSM-based evaluation is a valuable tool for understanding and improving the algorithmic reliability of LLMs. By incorporating compositional inductive biases, explicit memory mechanisms, and modular execution scaffolds, future architectures and prompting strategies can address these limitations.

The findings also suggest that enabling LLMs to externalize intermediate steps can mitigate some of these challenges, providing a path forward for enhancing their procedural reasoning capabilities.