CATArena: A New Tournament-Based Benchmark for Evaluating Learning Abilities of LLM Agents

Shanghai Jiao Tong University, AGI-Eval, and Meituan researchers have unveiled CATArena, a tournament-style evaluation platform designed to assess the learning capabilities of Large Language Model (LLM) agents. This new benchmark addresses the limitations of existing evaluation methods that often focus on end-to-end performance in fixed scenarios, which restricts the comprehensive evaluation of an agent's core abilities, particularly learning. CATArena employs an iterative peer-learning framework where agents refine their strategies through repeated interactions and feedback in four diverse board and card games. By offering tasks without explicit upper score limits, the platform facilitates continuous and dynamic evaluation of agents' evolving capabilities. The games include variants to encourage strategy generalization and reduce rote memorization. The framework assesses strategy coding, learning ability (global learning, counter-adaptation, and self-improvement), and generalizability. Experiments involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities. Key findings indicate that the performance gap among LLMs is more pronounced in minimal agents compared to commercial agents, suggesting that the underlying agent framework significantly influences the effective utilization of an LLM's capabilities. The collective learning dynamics of agents across tasks reveal that they can improve strategies in simpler environments, while their learning capacity remains limited in more challenging tasks. The researchers also demonstrated that CATArena's iterative peer-learning framework is extensible to new tasks, such as a Machine Learning (ML) track and a multi-lingual track, highlighting its adaptability for evaluating other fundamental agent abilities. While the current evaluation is limited to four games, future work plans to introduce a wider variety of more complex tasks.