OpenAI's o3 tops AI science rankings
Source: nature.com
AI Model o3 Ranks Best in Science Question Answering
o3, an artificial intelligence (AI) model by the creators of ChatGPT, has been ranked as the top AI tool for answering science questions across various fields. This is according to a benchmarking platform that launched recently.
SciArena, created by the Allen Institute for Artificial Intelligence (Ai2) in Seattle, Washington, assessed 23 large language models (LLMs) based on their responses to scientific questions. The quality of the answers was determined by 102 researchers.
Rankings and Performance
OpenAI's o3, based in San Francisco, California, was rated the best at answering questions related to natural sciences, health care, engineering, and humanities and social science, receiving over 13,000 votes. DeepSeek-R1, developed by DeepSeek in Hangzhou, China, ranked second in natural sciences and fourth in engineering. Google’s Gemini-2.5-Pro was third in natural sciences and fifth in both engineering and health care.
Arman Cohan, a research scientist at Ai2, suggests that users might prefer o3 because it tends to provide detailed information from the literature it cites and offers technically nuanced answers. He also notes that explaining the variations in model performance is difficult due to the proprietary nature of most models. Differences in training data and optimization goals could be factors.
SciArena Platform
SciArena is a platform for evaluating AI models on specific tasks and is among the first to rank performance on scientific tasks using crowdsourced feedback. Rahul Shome, a robotics and AI researcher at the Australian National University in Canberra, describes SciArena as a positive step that encourages careful evaluation of LLM-assisted literature tasks.
To rank the LLMs, SciArena requested researchers to submit scientific questions and provided answers from two randomly selected models, which included references from Semantic Scholar, an AI research tool also from Ai2. Users then voted on whether one model provided the best answer, the models were comparable, or both performed poorly. The platform is publicly accessible, allowing users to ask research questions and vote on model performance. Only votes from verified users who agree to the terms are included in the frequently updated leaderboard.
Jonathan Kummerfeld, an AI researcher at the University of Sydney in Australia, believes that the ability to question LLMs on science topics and trust the answers will help researchers stay informed about the latest literature in their fields. He adds, “This will help researchers find work they may have otherwise missed.”