CATArena: New AI Tournament Benchmark Focuses on Learning and Strategy Coding in LLM Agents

A new benchmark, CATArena, has been developed to evaluate the learning abilities of Large Language Model (LLM) agents through iterative tournament competitions. The framework emphasizes the importance of learning ability, including both self-improvement and peer-learning, as core drivers for agent evolution towards human-level intelligence. CATArena addresses the limitations of current benchmarks, which primarily assess end-to-end performance in fixed scenarios, leading to score saturation and growing dependence on expert annotation. The new tournament-style evaluation platform features four diverse board and card games with open-ended scoring. This approach allows for continuous and dynamic evaluation of rapidly advancing agent capabilities by providing tasks without explicit upper score limits. CATArena features an iterative, competitive peer-learning framework, where agents refine and optimize their strategies through repeated interactions and feedback, systematically evaluating their learning capabilities. Experimental results involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding. The framework is extensible and can be adapted to other types of open-ended, rankable tasks, facilitating the assessment of core agent abilities in new domains.