CATArena: New AI Tournament Platform Evaluates LLM Agent Learning Through Competitive Games

A new evaluation platform, CATArena, has been developed to address the growing need for robust benchmarks that measure the learning capabilities of Large Language Model (LLM) agents. Researchers from Shanghai Jiao Tong University, AGI-Eval, and Meituan introduced the platform, which features an iterative, competitive peer-learning framework designed to assess both self-improvement and peer-learning in LLM agents. Existing benchmarks often focus on end-to-end performance in fixed scenarios, leading to score saturation and a dependence on expert annotation. CATArena aims to overcome these limitations by providing an environment where agents refine strategies through repeated interactions and feedback in open-ended games. The CATArena platform includes four diverse board and card games with open-ended scoring. This allows for continuous and dynamic evaluation as agent capabilities advance. The framework evaluates strategy coding, learning ability, and generalizability. Experiments involving both minimal and commercial code agents demonstrate that CATArena offers a reliable, stable, and scalable method for benchmarking core agent abilities. The iterative peer-learning framework within CATArena requires agents to revise and update their strategies based on outcomes and policies observed in previous rounds. After each update, the agent policy codes compete against each other, generating dynamic performance rankings. The platform is extensible and adaptable to other types of open-ended tasks, facilitating the assessment of core agent abilities in new domains. The researchers conducted comparative performance evaluations and data analysis using self-developed minimal code agents and state-of-the-art commercial code agents. Results consistently showed that CATArena provided stable and reliable benchmarks for assessing agent capabilities and the agentic potential of the underlying LLMs. Future work plans to introduce a wider variety of more complex tasks to evaluate agents’ learning and other abilities from different perspectives.