CATArena: A New Benchmark for Evaluating Learning Ability in LLM Agents Through Iterative Tournament Competitions
Published on November 4, 2025 at 05:00 AM
Shanghai Jiao Tong University, AGI-Eval, and Meituan researchers have unveiled CATArena, a new benchmark for evaluating the learning ability of Large Language Model (LLM) agents. CATArena addresses the limitations of current benchmarks that primarily focus on end-to-end performance in fixed scenarios, leading to score saturation and increased reliance on expert annotation.
The iterative, competitive peer-learning framework allows agents to refine their strategies through repeated interactions and feedback. CATArena features four diverse board and card games with open-ended scoring, enabling continuous and dynamic evaluation of rapidly advancing agent capabilities. The games include Gomoku, Texas Hold'em, Chess, and Bridge, with variants designed to encourage strategy generalization and discourage rote memorization.
Experimental results involving both minimal and commercial code agents demonstrate that CATArena provides a reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding. The framework utilizes a scoring matrix to quantitatively assess strategy coding, learning, and generalizability. Experiments showed that the performance gap among LLMs is more pronounced in minimal agents compared to commercial agents, suggesting that the agent framework significantly influences how effectively an LLM’s capabilities are utilized. The analysis also revealed distinct performance distributions across different tasks, highlighting the varied nature and difficulty of the challenges presented by CATArena.
The researchers emphasize that CATArena's iterative peer-learning framework is easily extensible to new tasks, paving the way for evaluating other fundamental agent abilities, like machine learning and multi-lingual adaptability. While the current evaluation is limited to four games, future work will introduce a wider variety of more complex tasks to evaluate agents' learning and other abilities from different perspectives.