CATArena: A New Benchmark for Evaluating Learning Abilities of LLM Agents Through Iterative Tournament Competitions

A team of researchers has unveiled CATArena, a new platform designed to evaluate the learning capabilities of Large Language Model (LLM) agents. As LLM agents evolve to autonomously complete complex tasks, existing benchmarks often fall short, primarily assessing end-to-end performance in fixed scenarios, which restricts evaluation to specific skills and suffers from score saturation. CATArena introduces an iterative, competitive peer-learning framework. This framework allows agents to refine and optimize their strategies through repeated interactions and feedback. It systematically evaluates their learning capabilities, both self-improvement and peer-learning. The platform features four diverse board and card games with open-ended scoring, addressing the score saturation issue in current benchmarks. The benchmark includes games such as Gomoku, Texas Hold'em, Chess, and Bridge, each with standard and variant rules to encourage strategy generalization and discourage rote memorization. The tournament format involves multiple rounds where agents analyze previous codes and logs to improve their strategies. Experimental results and analyses, involving both minimal and commercial code agents, demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding. The framework also facilitates the assessment of core agent abilities in new domains. The research highlights that CATArena’s strategy coding tasks are fundamentally different from traditional LLM reasoning tasks, representing a novel evaluation dimension. The iterative peer-learning framework is extensible to new tasks for evaluating other fundamental agent abilities, as demonstrated by the introduction of Machine Learning (ML) and multi-lingual tracks.