A team of researchers has unveiled CATArena (Code Agent Tournament Arena), a new evaluation platform designed to assess the learning and strategic coding abilities of Large Language Model (LLM) agents. This framework addresses the limitations of existing benchmarks, which often focus on end-to-end performance in fixed scenarios, leading to score saturation and increased reliance on expert annotation. CATArena utilizes an iterative peer-learning approach, where agents refine their strategies through repeated interactions and feedback within a competitive environment. The platform features four diverse board and card games with open-ended scoring, eliminating explicit upper score limits and allowing for continuous evaluation of agent capabilities. The competitive arena is inherently extensible and can be adapted to other types of open-ended tasks. Experiments involving both minimal and commercial code agents demonstrate that CATArena offers reliable, stable, and scalable benchmarking for core agent abilities, particularly learning and strategy coding. The iterative peer-learning framework enables a granular assessment of basic coding skills and advanced learning capabilities, supporting a robust measurement of agent performance. The researchers designed general scoring metrics to systematically assess fundamental agent abilities, including strategy coding, learning, and generalizability. Comparative performance evaluations and data analysis were conducted using a self-developed minimal code agent and state-of-the-art commercial code agents. Key findings include:

The performance gap among LLMs is more pronounced in minimal agents compared to commercial agents, suggesting that the underlying agent framework significantly influences the effective utilization of an LLM's capabilities.
Participating agents display different ranking orders across various core abilities, offering insights for further optimization of both LLMs and agent frameworks.
Agents exhibit varied performance distributions across different tasks, indicating that the distinct nature and difficulty of the tasks impact agent performance.

The CATArena platform’s iterative peer-learning framework is designed to be easily extensible to new tasks, allowing for the evaluation of other fundamental agent abilities. Further research is planned to introduce a wider variety of more complex tasks to evaluate agents’ learning and other abilities from different perspectives.