CATArena: A New Benchmark for Evaluating LLM Agent Learning Through Iterative Tournament Competitions

Published on November 4, 2025 at 05:00 AM
A team of researchers has introduced CATArena (Code Agent Tournament Arena), a novel benchmark designed to evaluate the learning abilities of Large Language Model (LLM) agents through iterative tournament competitions. CATArena addresses the limitations of current benchmarks, which primarily focus on end-to-end performance in fixed scenarios and often suffer from score saturation and reliance on expert annotation. The core of CATArena is an iterative, competitive peer-learning framework that allows agents to refine and optimize their strategies through repeated interactions and feedback. The platform features four diverse board and card games with open-ended scoring, providing tasks without explicit upper score limits and enabling continuous, dynamic evaluation of rapidly advancing agent capabilities. Key features of CATArena include:
  • Iterative Peer-Learning: Agents continuously revise their strategies based on feedback and outcomes from previous rounds.
  • Open-Ended Games: CATArena includes board and card games that provide an unlimited upper bound for agent improvement.
  • Comprehensive Evaluation Metrics: General and systematic evaluation matrices are used to assess strategy coding, learning ability, and generalizability.
Experiments involving both minimal and commercial code agents demonstrate that CATArena offers reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding. The researchers emphasize that strategy coding tasks in CATArena are fundamentally different from traditional LLM reasoning tasks, representing a novel evaluation dimension. CATArena's extensible architecture can be adapted to other types of open-ended, rankable tasks, facilitating the assessment of core agent abilities in new domains as agent capabilities continue to advance. The framework also supports the incorporation of tasks with greater complexity and discrimination, ensuring ongoing evaluation without the need for expert-level human annotation.