CATArena: New Framework Evaluates LLM Agent Learning Through Competitive Tournaments

Shanghai Jiao Tong University, AGI-Eval, and Meituan researchers have unveiled CATArena, a new benchmark designed to evaluate the learning abilities of Large Language Model (LLM) agents. This iterative, competitive peer-learning framework addresses limitations found in traditional benchmarks, which often focus on end-to-end performance in fixed scenarios and struggle with score saturation as agent capabilities improve. CATArena emphasizes the importance of learning ability, including both self-improvement and peer learning, as a core driver for agent evolution towards human-level intelligence. The framework uses repeated interactions and feedback to enable agents to refine their strategies systematically. To overcome score saturation issues, CATArena features four diverse board and card games with open-ended scoring, allowing continuous and dynamic evaluation. The platform’s tournament-style format and scoring system are designed to quantify core agent abilities, particularly learning ability and strategy coding. Experiments involving minimal and commercial code agents demonstrate that CATArena offers reliable, stable, and scalable benchmarking. The framework is inherently extensible and can be adapted to other types of open-ended, rankable tasks, facilitating the assessment of core agent abilities in new domains. The CATArena benchmark is available on GitHub.