Re-Schedule Algorithm Enhances LLM Reinforcement Learning

A new study introduces the Re-Schedule algorithm, a data scheduling method designed to improve the efficiency and accuracy of Large Language Model (LLM) training through Reinforcement Learning with Verifiable Rewards (RLVR). The algorithm addresses limitations in existing methods by incorporating a novel metric called the Reasoning Score (r-score), which measures a query's learning difficulty based on the structure of its reasoning tree.

Re-Schedule consists of three stages: reasoning tree construction, r-score calculation, and dynamic weighting. By constructing an offline approximation of each query's reasoning tree and integrating the r-score into the RLVR loss function, the algorithm creates a curriculum that progresses from simple to complex queries, enhancing training outcomes.

Key Features of Re-Schedule

Reasoning tree construction for structural understanding
R-score calculation for measuring learning difficulty
Dynamic weighting of queries in the RLVR loss function

Experimental Results

Experiments on six math-reasoning benchmarks demonstrated that Re-Schedule significantly improves average accuracy, with gains of up to 3.2%. These results validate the approach and highlight the importance of a structural understanding of reasoning trees in RLVR data scheduling.

Implications for LLM Training

The Re-Schedule algorithm offers a promising approach to enhancing LLM training efficiency. By focusing on the structural complexity of queries, it provides a more effective framework for curriculum learning, potentially leading to better performance on complex reasoning tasks.