Large Language Models Face Challenges in Procedural Reasoning

A recent study has exposed significant limitations in the procedural reasoning capabilities of large language models (LLMs). These models, while proficient in many reasoning tasks, struggle to execute multi-step, rule-based computations over extended horizons. The research introduces the Finite-State Machine (FSM) Execution framework as a method to evaluate LLMs' ability to maintain state consistency across steps, highlighting a systematic degradation in performance as task complexity increases.

The Finite-State Machine Framework

The FSM framework requires LLMs to execute a defined finite-state machine step-by-step, maintaining state consistency based on input actions. This approach isolates the models' ability to apply deterministic transition rules, independent of world knowledge. By measuring both Turn Accuracy (local correctness) and Task Accuracy (global long-horizon correctness), researchers differentiate between immediate computational ability and cumulative state maintenance.

Key Findings

Empirical results show that LLMs' performance degrades as the task horizon or branching complexity increases. Models falter when rule retrieval involves high branching factors (many actions per state) but perform better in scenarios requiring long memory spans (many states, few actions). Larger models demonstrate improved local accuracy but remain vulnerable under multi-step reasoning unless explicitly prompted to externalize intermediate steps.

Implications for LLM Design

These findings suggest an 'illusion of procedural reasoning,' where LLMs mimic algorithmic behavior for short traces but fail to sustain coherent execution as procedural depth grows. The research proposes that FSM-based evaluation could guide the development of inductive biases, memory mechanisms, and reasoning scaffolds to enhance LLMs' algorithmic reliability.

Conclusion

The study underscores the need for targeted improvements in LLM design to address procedural reasoning challenges. By leveraging frameworks like FSM Execution, researchers can better understand and mitigate the limitations of these models in complex, rule-based tasks.