CATArena: A New Tournament-Style Benchmark for Evaluating LLM Agent Learning Abilities

Shanghai Jiao Tong University, AGI-Eval, and Meituan researchers have unveiled CATArena, a novel benchmark designed to evaluate the learning abilities of Large Language Model (LLM) agents. Addressing the limitations of current benchmarks that focus on end-to-end performance in fixed scenarios, CATArena emphasizes the importance of learning ability, including self-improvement and peer-learning, as key drivers for agent evolution towards human-level intelligence. The iterative, competitive peer-learning framework allows agents to refine and optimize their strategies through repeated interactions and feedback. CATArena features four diverse board and card games with open-ended scoring, mitigating score saturation issues. This tournament-style evaluation platform enables continuous and dynamic assessment of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate CATArena's reliability, stability, and scalability in benchmarking core agent abilities, particularly learning ability and strategy coding. Key findings include the observation that commercial agents exhibit stronger learning capabilities compared to minimal agents, and that the performance gap among LLMs is more pronounced in minimal agents. The framework is also extensible to new tasks, demonstrated through the introduction of a Machine Learning (ML) track and multi-lingual track, indicating substantial potential for improvement in current agents. CATArena's open and flexible architecture is poised to support ongoing research and benchmarking for future intelligent agents.