CATArena: A Novel Benchmark for Evaluating Learning Abilities of LLM Agents in Competitive Tournaments

Shanghai Jiao Tong University, AGI-Eval, and Meituan researchers have introduced CATArena, a novel evaluation platform designed to assess the learning abilities of Large Language Model (LLM) agents. Unlike existing benchmarks that primarily focus on end-to-end performance in fixed scenarios, CATArena emphasizes learning ability, including self-improvement and peer-learning, as a core driver for agent evolution. CATArena employs an iterative, competitive peer-learning framework, where agents refine their strategies through repeated interactions and feedback. The platform features four diverse board and card games with open-ended scoring, enabling continuous and dynamic evaluation of rapidly advancing agent capabilities. This tournament-style environment addresses the score saturation issue in current benchmarks by providing tasks without explicit upper score limits. Experiments involving minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding. The evaluation metrics include strategy coding ability, global learning improvement, counter-adaptation, self-improvement, and generalizability to novel game rules. The researchers highlight that CATArena's tasks are fundamentally different from traditional LLM reasoning tasks, representing a novel evaluation dimension. The iterative peer-learning framework allows agents to continuously revise their strategies based on feedback from previous rounds, aligning agent evolution with human evolution. CATArena supports extensible evaluation across diverse, open-ended tasks and facilitates ongoing evaluation without expert-level human annotation. Limitations of the current evaluation include the limited number of games, which primarily assess learning ability and strategy coding but do not encompass the full spectrum of potential LLM agent capabilities. Future work plans to incorporate a wider variety of more complex tasks to evaluate agents' learning and other abilities from different perspectives.