Re-Schedule Algorithm Improves LLM Reinforcement Learning with Reasoning Trees

A new study introduces the Reasoning Tree Schedule (Re-Schedule), a data scheduling algorithm designed to improve the efficiency and accuracy of Large Language Model (LLM) training using Reinforcement Learning with Verifiable Rewards (RLVR). The algorithm addresses the limitations of existing methods that primarily rely on path-based metrics by incorporating a novel metric called the Reasoning Score (r-score). The r-score measures a query's learning difficulty based on the structure of its reasoning tree, allowing the algorithm to construct a curriculum that progresses from structurally simple to complex queries. Re-Schedule consists of three main stages: reasoning tree construction, r-score calculation, and dynamic weighting. The algorithm constructs an offline approximation of each query's reasoning tree by sampling multiple solution trajectories. It then calculates the r-score by simulating the editing process and integrates the r-score as a dynamic weight into the RLVR loss function. Experiments conducted on six math-reasoning benchmarks demonstrate that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These results validate the approach and highlight that a structural understanding of the reasoning tree provides a more powerful foundation for RLVR data scheduling. The main contributions of this paper include the introduction of the r-score, the Re-Schedule algorithm, and empirical results that demonstrate improved average accuracy on complex reasoning tasks.