CATArena: New AI Benchmark Uses Iterative Tournaments to Evaluate Learning in LLM Agents

A new benchmark called CATArena has been developed to evaluate the learning capabilities of Large Language Model (LLM) agents. Proposed by researchers from Shanghai Jiao Tong University, AGI-Eval, and Meituan, CATArena moves beyond traditional fixed-scenario assessments by utilizing iterative, competitive peer-learning. This allows agents to refine their strategies through repeated interactions and feedback. The CATArena platform features a tournament-style environment with four diverse board and card games that provide open-ended scoring. This approach enables continuous and dynamic evaluation of rapidly advancing agent capabilities, addressing the score saturation issues found in current benchmarks. The core abilities assessed include strategy coding, learning ability (self-improvement and peer-learning), and generalizability. Experimental results involving both minimal and commercial code agents demonstrate CATArena's reliability, stability, and scalability for benchmarking core agent abilities. The framework facilitates a granular assessment of basic coding skills and advanced learning capabilities. The iterative peer-learning approach aims to align agent evolution with human evolution, offering valuable insights into the learning abilities of LLM agents. The CATArena benchmark is extensible, capable of incorporating new tasks and domains to evaluate other fundamental agent abilities like machine learning and multi-lingual code generation. Researchers plan to expand the variety and complexity of tasks to provide a more comprehensive evaluation of LLM agent potential.