CATArena: A Novel Benchmark for Evaluating Learning Ability in LLM Agents Through Iterative Tournament Competitions

Researchers from Shanghai Jiao Tong University, AGI-Eval, and Meituan have introduced CATArena, a novel benchmark for evaluating the learning abilities of Large Language Model (LLM) agents. CATArena utilizes an iterative, competitive peer-learning framework, enabling agents to refine their strategies through repeated interactions and feedback. Existing benchmarks often focus on end-to-end performance in fixed scenarios, limiting the assessment of an agent’s overall capabilities and leading to score saturation. CATArena addresses these limitations by featuring four diverse board and card games with open-ended scoring, allowing for continuous and dynamic evaluation as agent capabilities advance. The framework includes an initial strategy development phase and an iterative improvement phase. Agents are required to revise and update their strategies based on outcomes and policies observed in previous rounds of competition. The platform assesses strategy coding, learning (including global learning, counter-adaptation, and self-improvement), and generalizability. Experiments involving minimal and commercial code agents demonstrate that CATArena provides a reliable, stable, and scalable benchmarking environment for core agent abilities, particularly learning ability and strategy coding. The iterative peer-learning framework aligns agent evolution with human evolution, supporting extensible evaluation across diverse, open-ended tasks. While the current evaluation is limited to four games, the researchers plan to expand CATArena with more complex tasks to evaluate a broader range of LLM agent capabilities.