CATArena: New AI Tournament Platform Emerges to Evaluate Learning and Strategy Coding in LLM Agents

A new benchmark called CATArena has been launched to address the growing need for comprehensive evaluation of Large Language Model (LLM) agents. Developed by researchers from Shanghai Jiao Tong University, AGI-Eval, and Meituan, CATArena focuses on assessing crucial skills like learning ability and strategy coding, moving beyond traditional end-to-end performance metrics. CATArena utilizes an iterative peer-learning framework where agents refine their strategies through repeated interactions and feedback in a tournament setting. The platform features four diverse board and card games with open-ended scoring, allowing for continuous and dynamic evaluation without score saturation. This design facilitates the systematic measurement and analysis of fundamental agent sub-abilities. Experiments conducted using minimal and commercial code agents demonstrate that CATArena offers a reliable, stable, and scalable approach to benchmarking core agent abilities. The platform’s extensible architecture can be adapted to various open-ended tasks, supporting ongoing evaluation as agent capabilities advance. Key findings highlight performance differences among LLMs, varied ranking orders across different agent capabilities, and diverse performance distributions across tasks, underscoring CATArena's utility in guiding further optimization of LLMs and agent frameworks.