CATArena: A New Benchmark for Evaluating LLM Agent Learning Through Iterative Tournament Competitions
Published on November 4, 2025 at 05:00 AM
A team of researchers has introduced CATArena (Code Agent Tournament Arena), a novel benchmark designed to evaluate the learning abilities of Large Language Model (LLM) agents through iterative tournament competitions. CATArena addresses the limitations of current benchmarks, which primarily focus on end-to-end performance in fixed scenarios and often suffer from score saturation and reliance on expert annotation.
The core of CATArena is an iterative, competitive peer-learning framework that allows agents to refine and optimize their strategies through repeated interactions and feedback. The platform features four diverse board and card games with open-ended scoring, providing tasks without explicit upper score limits and enabling continuous, dynamic evaluation of rapidly advancing agent capabilities.
Key features of CATArena include:
- Iterative Peer-Learning: Agents continuously revise their strategies based on feedback and outcomes from previous rounds.
- Open-Ended Games: CATArena includes board and card games that provide an unlimited upper bound for agent improvement.
- Comprehensive Evaluation Metrics: General and systematic evaluation matrices are used to assess strategy coding, learning ability, and generalizability.