CATArena: A New Benchmark for Evaluating LLM Agent Learning Through Competitive Tournaments

A new study introduces CATArena, a benchmark designed to evaluate the learning capabilities of Large Language Model (LLM) agents. Unlike existing benchmarks that primarily assess end-to-end performance in fixed scenarios, CATArena focuses on an agent's ability to learn and adapt through iterative peer-learning and competition. The CATArena framework involves agents refining and optimizing their strategies through repeated interactions and feedback in tournament-style competitions featuring four diverse board and card games. The open-ended scoring system allows for continuous and dynamic evaluation of agent capabilities without explicit upper score limits. The benchmark includes metrics for assessing strategy coding, learning ability (global learning, counter-adaptation, and self-improvement), and generalizability. Experiments involving both minimal and commercial code agents demonstrate that CATArena provides a reliable, stable, and scalable platform for benchmarking core agent abilities. The study highlights that the performance gap among LLMs is more pronounced in minimal agents compared to commercial agents, suggesting the importance of the agent framework. The iterative peer-learning approach and extensible evaluation across diverse tasks make CATArena a valuable tool for advancing LLM agent research and development.