A team of researchers has unveiled CATArena, a new evaluation platform designed to systematically measure the learning ability and strategy coding skills of Large Language Model (LLM) agents. Unlike existing benchmarks that focus on end-to-end performance in fixed scenarios, CATArena emphasizes continuous learning and adaptation through iterative tournament competitions. The platform features a competitive peer-learning framework where agents refine their strategies through repeated interactions and feedback. CATArena incorporates four diverse board and card games with open-ended scoring, eliminating the score saturation issues found in traditional benchmarks. This design allows for ongoing, dynamic evaluation of rapidly advancing agent capabilities without requiring constant expert annotation. Experiments involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning and strategy coding. The platform's architecture is extensible, allowing it to adapt to other open-ended tasks and facilitate the assessment of core agent abilities in new domains. The researchers also introduced Machine Learning (ML) and multi-lingual tracks to further evaluate agent abilities. The key contributions of CATArena include:

An iterative peer-learning framework where agents revise strategies based on feedback from previous rounds.
A tournament-style benchmark using diverse open-ended games, including board and card games, with unlimited potential for agent improvement.
Comprehensive evaluation metrics and comparative experiments demonstrating the reliability, stability, and extensibility of CATArena.